Web archives as research infrastructure for digital societies: the case study of Arquivo.pt

Daniel Gomes

Abstrakt

Archiwum internetu jako infrastruktura badawcza społeczeństwa cyfrowego: studium przypadku Arquivo.pt

Ludzkość jest dominującym gatunkiem na Ziemi. Nasza przewaga ma źródło w unikalnej zdolności organizowania się na dużą skalę dla osiągnięcia wspólnych celów. W społeczeństwie cyfrowym wszelka organizacja wymaga przekazywania informacji, a współcześnie jej większość jest publikowana wyłącznie online. Problem stanowi to, iż informacja online znika bardzo szybko, już po kilku miesiącach. Zależność ludzkości od informacji online jest bardzo duża i wciąż aktualna, a konsekwencje utraty perspektywy historycznej w odniesieniu do danych online nie zostały dotąd zbadane. Archiwa internetowe są cyfrowymi systemami przechowywania, które gromadzą, zachowują i udostępniają historyczne dane stron internetowych.

Są one używane przez badaczy. Jednakże archiwa internetowe, aby słu­żyć społeczeństwu cyfrowemu, powinny być także wykorzystywane przez szerszy krąg użytkowników. Arquivo.pt jest publicznym archiwum inter­netowym, uruchomionym w 2007 r., które umożliwia prowadzenie badań i dostęp do danych historycznych stron internetowych, zachowanych od lat dziewięćdziesiątych XX w. W artykule zaprezentowano portal Arquivo.pt jako studium przypadku dotyczące infrastruktury badawczej rozwijanej do obsługi szerokiego grona użytkowników na poziomie krajowym i między­narodowym. Artykuł prezentuje najważniejsze wnioski mogące przysłużyć się powstawaniu i szybszemu rozwojowi innych inicjatyw archiwizacji In­ternetu. Opisuje także istniejące narzędzia i podejścia umożliwiające bada­nie historycznych zbiorów internetowych. Wreszcie, prezentuje wyzwania wiążące się z tworzeniem archiwów internetowych oraz propozycje działań w tym zakresie.

ABSTRACT 

Humans are the dominant species on Earth. Our advantage comes from our unique capacity of organising at large scale to reach common goals. In digital societies, organising requires communicating information and these days, most of it is published exclusively online. The problem is that online information disappears quickly, after a few months. Humanity’s dependence on online information is strong but still recent and the consequences of losing the historical perspective over online data are yet to be seen. Web archives are digital preservation systems that collect, store and provide access to historical web data. Scientific researchers have been using web archives. However, web archives should also be used by the wider public so that they may serve digital societies. Arquivo.pt is a public web archive started in 2007 that enables search and access to historical information preserved from the Web since the 1990s. This article presents Arquivo. pt as a case study for a research infrastructure that has been developed to serve wider communities at national and international levels. The article shares the main lessons learned so that other web archiving initiatives may arise and be developed at a faster pace. It describes the existing tools and activities which enable exploration of historical web-archived collections. Finally, it presents challenges related to creating web archives and proposes actions to address them.

Słowa kluczowe: archiwizacja internetu, przechowywanie cyfrowe, rekomendacje, Web archiving, digital preservation, recommendations
References

Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C. and Nelson, M.L., 2011, June. How much of the web is archived? In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries (pp. 133–136).

Ainsworth, S.G., Nelson, M.L. and de Sompel, H.V., 2015. Evaluating the Temporal Coherence of Archived Pages.

Alam, S., Weigle, M., Nelson, M., Melo, F., Bicho, D. and Gomes, D., 2019, June. MementoMap framework for flexible and adaptive web archive profiling. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 172–181). IEEE.

AlSum, A., Weigle, M.C., Nelson, M.L. and Van de Sompel, H., 2014. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries, 14(3), 149166.

Ben-David, A. and Amram, A., 2018. The Internet Archive and the socio-technical construction of historical facts. Internet Histories, 2(12), pp. 179201.

Ben-David, A., 2019. 2014 not found: a cross-platform approach to retrospective web archiving. Internet Histories, 3(34), pp. 316342.

Ben-David, A., 2019. National web histories at the fringe of the Web: Palestine, Kosovo, and the quest for online self-determination. In The Historical Web and Digital Humanities (pp. 89– 109). Routledge.

Bicho, D. and Gomes, D., 2016. Preserving Websites Of Research & Development Projects. In iPRES.

Brügger, N. and Laursen, D. eds., 2019. The historical web and digital humanities: the case of national web domains. Routledge.

Brügger, N. and Milligan, I. eds., 2018. The SAGE handbook of web history. Sage. Brügger, N. ed., 2010. Web history (Vol. 56). Peter Lang.

Brügger, N., 2005. Archiving Websites. General Considerations and Strategies: General Considerations and Strategies.

Brügger, N., 2018. The archived web: doing history in the digital age. MIT Press.

Brügger, N., Goggin, G., Milligan, I. and Schafer, V., 2017. Introduction: Internet histories. Internet Histories, 1(12), pp. 17.

Brügger, N., Locatelli, E., Weber, M. and Nanni, F., 2017. Web 25: histories from the first 25 years of the World Wide Web.

Classificação automática de artigos estigmatizantes de doenças mentais em jornais de notícias portugueses online, https://alina-yanchuk02.github.io/estigma/, accessed: 31 October 2022.

Costa M., 2014. Information Search in Web Archives (Doctoral dissertation, Universidade de Lisboa (Portugal)).

Costa, M., Gomes, D. and Silva, M.J., 2017. The evolution of web archiving. International Journal on Digital Libraries, 18(3), pp. 191205.

Cruz, D. and Gomes, D., 2013, September. Adapting search user interfaces to web archives. In Proc. of the 10th International Conference on Preservation of Digital Objects (Vol. 17).

Dados.gov.pt – Portal de dados abertos da Administração Pública, Arquivo.pt – pesquise páginas do passado, https://arquivo.pt/dadosabertos, accessed 31 October 2022.

Gomes, D. and Costa, M., 2014. The importance of web archives for humanities. International Journal of Humanities and Arts Computing, 8(1), pp. 106123.

Gomes, D. and Silva, M.J., 2006, July. Modelling information persistence on the web. In Proceedings of the 6th international conference on Web engineering (pp. 193200).

Gomes, D. and Silva, M.J., 2008. The Viúva Negra crawler: an experience report. Software: Practice and Experience, 38(2), pp. 161188.

Gomes, D., Costa, M., Cruz, D., Miranda, J. and Fontes, S., 2013, May. Creating a billion-scale searchable web archive. In Proceedings of the 22nd International Conference on World Wide Web (pp. 10591066).

Gomes, D., Demidova, E., Winters, J. and Risse, T., 2021. Past Web. Springer International Publishing.

Gomes, D.C., 2006. Web Modelling for Web Warehouse Design (Doctoral dissertation, Universidade de Lisboa (Portugal)).

Graham, S., Milligan, I., Weingart, S.B. and Martin, K., 2016. Exploring big historical data: the historian’s macroscope.

Harari, Y.N., 2014. Sapiens: A brief history of humankind. Random House.

Hockx-Yu, H., Laursen, D. and Gomes, D., 2019. The curious case of archiving. eu. In The Historical Web and Digital Humanities (pp. 6472). Routledge.

International Internet Preservation Consortium, SolrWayback 4.0 release! What’s it all about? Part 2, https://netpreserveblog.wordpress.com/2021/03/04/solrwayback-4-0-release-whats-it-all-about-part-2/, accessed 31 October 2022.

Internet Archive, Wayback Machine Save Page Now, https://web.archive.org/save/, accessed 31 October 2022.

ISO 28500:2017 Information and documentation — WARC file format.

Jones, S.M., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R. and Grover, C., 2016. Scholarly context adrift: three out of four URI lead to changed content. PloS one, 11(12).

Kahle, B., 1997. Preserving the internet. Scientific American, 276(3), pp. 8283.

Klein, M. and Nelson, M.L., 2014. Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. International Journal on Digital Libraries, 14(1), 1738.

Klein, M., Balakireva, L. and Van de Sompel, H., 2018, May. Focused crawl of web archives to build event collections. In Proceedings of the 10th ACM Conference on Web Science (pp. 333342).

Masanes, J., 2006. Web archiving: issues and methods. In Web archiving (pp. 153). Springer, Berlin, Heidelberg.

Masanès, J., Major, D. and Gomes, D., 2021. The Past Web: A Look into the Future. In The Past Web (pp. 285291). Springer.

Milligan, I., 2019. History in the age of abundance?: how the web is transforming historical research. McGill-Queen’s University Press.

Milligan, I., 2022. The Transformation of Historical Research in the Digital Age. Elements in Historical Theory and Practice.

Ministério da Educação e Ciência, Decreto-Lei n.º 55/2013, Diário da República, n.º 75/2013, Série I de 2013-04-17, páginas 2257–2261.

Miranda, J. and Gomes, D., 2009, November. Trends in Web characteristics. In 2009 Latin American Web Congress (pp. 146153). IEEE.

Mourão, A. and Gomes, D., 2021. The Anatomy of a Web Archive Image Search Engine-Technical Report, https://sobre.arquivo.pt/wp-content/uploads/The_Anatomy_of_a_Web_Archive_ Image_Search_Engine_tech_report-1.pdf, accessed 31 October 2022.

Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information, http://data.europa.eu/eli/dir/2019/1024/oj, accessed 31 October 2022.

Quitney Anderson, J., 2009. Tim Berners-Lee launches “WWW Foundation” at IGF 2009, https:// arstechnica.com/tech-policy/2009/11/tim-berners-lee-launches-www-foundation-at-    igf-2009/, accessed 31 October 2022.

Ruest, N., Lin, J., Milligan, I. and Fritz, S., 2020, August. The archives unleashed project: Technology, process, and community to improve scholarly access to web archives. In Proceedings of the ACM/ IEEE Joint Conference on Digital Libraries in 2020 (pp. 157166), https://archivesunleashed. org/, accessed 31 October 2022.

SalahEldeen, H.M. and Nelson, M.L., 2013, May. Carbon dating the web: estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web (pp. 10751082).

Schafer, V. and Winters, J., 2021. The values of web archives. International Journal of Digital Humanities, 2(1), pp. 129144.

Schroeder, R. and Brügger, N., 2017. The Web as History: Using Web Archives to Understand the Past and the Present (p. 296). UCL Press.

Sherratt, T. and Jackson, A., 2020. GLAM-Workbench/web-archives, https://glam-workbench.net/ web-archives/, accessed 31 October 2022.

Spaniol, M., Mazeika, A., Denev, D. and Weikum, G., 2009, September. Catch me if you can: Visual analysis of coherence defects in web archiving. In 9th International Web Archiving Workshop (IWAW 2009), Corfu, Greece (pp. 2737).

Upwork, How Much Does It Cost To Build a Website? (2022 Data), https://www.upwork.com/ resources/how-much-does-it-cost-to-build-website, accessed 31 October 2022.

Van de Sompel, H., Nelson, M. and Sanderson, R., 2013. RFC 7089-HTTP framework for time- based access to resource states-Memento. Internet Engineering Task Force (IETF), RFC.

Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S. and Shankar, H., 2009. Memento: Time travel for the web. arXiv preprint arXiv:0911.1112.

Winters, J., 2015. „Big UK Domain Data for the Arts and Humanities”, Presentation, 2015 International Internet Preservation Coalition General Assembly, April 27-May 1, 2015. Silicon Valley, California, https://buddah.projects.history.ac.uk/, accessed 31 October 2022.

 

Internet sources:

Arquivo do Parlamento, https://arquivo-parlamento.pt/, accessed 31 October 2022.

Arquivo.pt, A first attempt to archive the .EU domain, https://sobre.arquivo.pt/en/a-first-attempt- to-archive-the-eu-domain/, accessed 31 October 2022.

Arquivo.pt, Arquivo.pt Application Programming Interfaces (APIs), https://arquivo.pt/api, accessed 31 October 2022.

Arquivo.pt, Arquivo.pt Awards, https://arquivo.pt/awards, accessed 31 October 2022.

Arquivo.pt, Arquivo.pt Memorial: preserves information of historical websites, https://arquivo.pt/ memorialen, accessed 31 October 2022.

Arquivo.pt, Cross-lingual collection about the 2019 European Elections is available, https://sobre. arquivo.pt/en/cross-lingual-collection-about-the-2019-european-elections-is-available/, accessed 31 October 2022.

Arquivo.pt, Exhibitions, https://arquivo.pt/exhibitions/, accessed 31 October 2022.

Arquivo.pt, H2020 projects preserved by Arquivo.pt, https://sobre.arquivo.pt/en/h2020-projects- preserved-by-arquivo-pt/, accessed 31 October 2022.

Arquivo.pt, Open dataset about cryptocurrency, https://sobre.arquivo.pt/en/open-dataset-about- cryptocurrency/, accessed 31 October 2022.

Arquivo.pt, Publications, https://arquivo.pt/publications, accessed 31 October 2022.

Arquivo.pt, Put an end to “page not found” on your website, https://arquivo.pt/arquivo404en, accessed 31 October 2022.

Arquivo.pt, Recommendations for authors to enable web archiving, https://arquivo.pt/ recommendations, accessed 31 October 2022.

Arquivo.pt, SavePageNow, https://arquivo.pt/savepagenow, accessed 31 October 2022. Arquivo.pt, Search the Geocities history!, https://sobre.arquivo.pt/en/historical-collection-geocities-available-at-arquivo-pt/, accessed 31 October 2022.

Arquivo.pt, Suggest websites to be preserved – Collaborate, https://arquivo.pt/suggest, accessed 31 October 2022.

Arquivo.pt, Training courses, https://arquivo.pt/training, accessed 31 October 2022. GitHub, Arquivo.pt, https://github.com/arquivo/, accessed 31 October 2022.

Memento Time Travel, http://timetravel.mementoweb.org/, accessed 31 October 2022.

Memória de festivais e eventos de arte, https://arteparasempre.wordpress.com/, accessed 31 October 2022.

MeuParlamento.pt, http://www.meuparlamento.pt/, accessed 31 October 2022.

Pywb, Configuring the Web Archive pywb 2.0 documentation, https://pywb.readthedocs.io/en/ latest/manual/configuring.html#recording-mode, accessed 31 October 2022.

Webrecorder: Web archiving for all!, https://webrecorder.net/, accessed 31 October 2022. Wikiquote, George Santayana, https://en.wikiquote.org/wiki/George_Santayana, accessed 31 October 2022.

Pierwotną wersją czasopisma jest wersja elektroniczna publikowana w Internecie.