Current Bachelor Thesis Topics
Bachelor Topics WS 2024/2025
1. Text Mining and Machine Learning
Supervisor: Johann MitlöhnerText mining aims to turn written natural language into structured data that allow for various types of analysis which are hard or impossible on the text itself; machine learning aims to automate the process using a variety of adaptive methods, such as artificial neural nets which learn from training data. Typical goals of text mining are Classification, Sentiment Detection, and other types of Information Extraction, e.g. Named Entity Recognition: identify people, places, organizations; Relation Extraction, e.g. locations of
organizations.
Connectionist methods and deep learning in particular have achieved much attention and success recently; these methods tend to work well on large training datasets which require ample computing power. Our institute has recently acquired high performance GPU units which are available for student use in thesis projects. It is highly recommended to use a framework such as PyTorch or Tensorflow/Keras for developing your deep learning application; the changes required to go from CPU to GPU computing will be
minimal. This means that you can start developing using your PC or notebook, or the Jupyter notebook server of the department, with a small subset of the training data; when you later transition to the GPU server more performance will mean that larger datasets become feasible.
On text mining e.g.: Minqing Hu, Bing Liu: Mining and summarizing customer reviews. KDD '04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177, ACM, 2004
For a more recent work and overview e.g.: Percha B. Modern Clinical Text Mining: A Guide and Review. Annu Rev Biomed Data Sci. 2021 Jul 20;4:165-187. doi: 10.1146/annurev-biodatasci-030421-030931. Epub 2021 May 26. PMID: 34465177.
2. Visualizing Data in Virtual and Augmented Reality
Supervisor: Johann MitlöhnerHow can AR and VR be used to improve exploration of data? Developing new methods for exploring and analyzing data in virtual and augmented reality presents many opportunities and challenges, both in terms of software development and design inspiration. There are various hardware options, from cheap but feasible, such as Google Cardboard, to more sophisticated and expensive. Taking part in this challenge demands programming skills as well as creativity. A basic VR or AR application for exploring a specific type of (open) data will be developed by the student. The use of a platform-independent kit such as A-Frame is essential, as the application will be compared in a small user study to its non-VR version in order to identify advantages and disadvantages of the visualization method implemented. Details will be discussed with supervisor.
Some References:
Butcher, Peter WS, and Panagiotis D. Ritsos. "Building Immersive Data Visualizations for the Web." Proceedings of International Conference on Cyberworlds (CW'17), Chester, UK. 2017.
Teo, Theophilus, et al. "Data fragment: Virtual reality for viewing and querying large image sets." Virtual Reality (VR), 2017 IEEE. IEEE, 2017.
Millais, Patrick, Simon L. Jones, and Ryan Kelly. "Exploring Data in Virtual Reality: Comparisons with 2D Data Visualizations." Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018.
Yu Shu, Yen-Zhang Huang, Shu-Hsuan Chang, and Mu-Yen Chen. Do virtual reality head-mounted displays make a difference? a comparison of presence and self-efficacy between head-mounted displays and desktop computer-facilitated virtual environments. Virtual Reality, 23(4):437-446, 2019
3. Assessing the Global Landscape of Cyber Threats with Empirical Evidence from United Nations
Supervisors: JMC Sturlese, Kabul KurniawanThe database of the United Nations Drugs and Crime Office (UNODC) provides open data on various threats to cyber security. What is of particular interest to this thesis, are the global datasets on unlawful access/interference to a computer system, unlawful access/interference to computer data1.
The aim of this bachelor thesis is to provide a detailed work on the potential threats to and vulnerabilities of cyber security (= theoretical part), and conducting an exploratory data analysis of the dataset provided by United Nations (= empirical part). Data analyses may be executed with R or Python. The objective of this thesis is to give an overview of the landscape of threats to cyber security and provide detailed plausible recommendations (potentially rooted in academic literature) in securing such vulnerabilities. In your application, please state why did topic interests you and which integrated development environment (IDE; e.g. R Studio, Jupyter Notebook, etc.) you prefer working with.
Sources
Basholli, F., Daberdini, A., & Basholli, A. (2023). Possibility of protection against unauthorized interference in telecommunication systems. Engineering Applications, 2(3), 265-278.
Furnell, S. M., & Warren, M. J. (1999). Computer hacking and cyber terrorism: The real threats in the new millennium?. Computers & Security, 18(1), 28-34.
Hatfield, J. M. (2019). Virtuous human hacking: The ethics of social engineering in penetration-testing. Computers & Security, 83, 354-366.
Van Daalen, O. L. (2023). The right to encryption: Privacy as preventing unlawful access. Computer Law & Security Review, 49, 105804.
Xia, H., & Brustoloni, J. (2004, May). Detecting and blocking unauthorized access in Wi-Fi networks. In International Conference on Research in Networking (pp. 795-806). Berlin, Heidelberg: Springer Berlin Heidelberg.
4. Post-Quantum Cryptographic Research in Europe and Asia
Supervisors: JMC Sturlese, Kabul KurniawanPost-quantum cryptography aims to develop systems secure against quantum attacks, which threaten cryptographic methods such as RSA and ECC [1]. Although ECC, used in services like eIDAS, is efficient today, both RSA and ECC are vulnerable to quantum algorithms like Shor’s, raising concerns about future security [2]. Research on breaking RSA, particularly in Asia, has advanced
significantly, and the region is also leading efforts to create quantum-safe alternatives like lattice-based cryptography [3, 4].
In this bachelor thesis, you will provide a systematic overview of the mechanisms behind conventional and post-quantum cryptography (= theoretical part). By means of a bibliometric analysis (VOSviewer), you will provide an overview of the literature on post-quantum cryptography with a particular focus on the contributing institutions’ places of origins (= empirical part). You will follow
cross-disciplinary branches of research and identify current trends on these topics. In the second part of the thesis, you will discuss your findings and reflect on their impact for research and practice.
Sources
Mallouli, Fatma, et al. ”A survey on cryptography: comparative study between RSA vs ECC algorithms, and RSA vs El-Gamal algorithms.” 2019. 6th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2019 5th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom). IEEE, 2019.
Bernstein, D. J., & Lange, T. (2017). Post-quantum cryptography. Nature, 549(7671), 188-194.
Oder, T., P¨oppelmann, T., & G¨uneysu, T. (2014, June). Beyond ECDSA and RSA: Lattice-based digital signatures on constrained devices. In Proceedings of the 51st Annual Design Automation Conference (pp. 1-6).
Dam, D. T., Tran, T. H., Hoang, V. P., Pham, C. K., & Hoang, T. T. (2023). A survey of post-quantum cryptography: Start of a new race. Cryptography, 7(3), 40.
5. Visualising AI System Patterns
Supervisor: Fajar J. EkaputraMain idea: Developing a tool for visualising AI system patterns from its ontological representation.
Background: As the number of systems that integrate symbolic and sub-symbolic artificial intelligence (AI) has increased significantly in recent years, efforts have been made to develop a standardized description for AI systems that combine Machine Learning with Semantic Web components (i.e., Semantic Web Machine Learning Systems - SWeMLS). Harmelen et al. proposed the boxology notation and a text-based notation to represent the main elements of SWeMLS [1]. These notations, however, are not machine-readable and needs to be created manually.
To provide a machine-readable version of these notations, Ekaputra et al. has developed a Semantic Web Machine Learning Systems (SWeMLS) ontology to represent such systems [2]. The SWeMLS ontology focuses on representing the systems, system components, inputs, outputs, and workflows between these elements.
Research Problem and Questions:
While the SWeMLS ontology represent SWeMLS in a machine-readable format, it is not suitable for communicating such systems to a broader audience. Therefore, it is necessary to develop a way to generate visual representation of these patterns (e.g., boxology notation) for a broader audience, e.g., as part of the documentation of SWeMLS.
The thesis aims to address this gap by focusing on the following research question: How to generate visual notation from SWeMLS ontology instances?
To this end, it focuses on designing and developing a visualization tool that can take as an input a SWeMLS description (i.e., an instance of SWeMLS ontology) and generate a graphical visualization of the system as an output. To this end, interested students could investigate existing frameworks for visualizing ontologies and ontology instances, e.g., [3] and related tools [4] that are related.
Expected Tasks:
Literature study on existing visualization frameworks for ontology instances
Design and development a visualization tool for AI system patterns
Visualization tool evaluation
Prior-Knowledge and Skills:
The student has ideally attended SBWL KM (especially K1 and K2).
Proficiency on at least one programming language (Java or Python preferred)
References:
[1] Van Harmelen, F., & Ten Teije, A. (2019). A boxology of design patterns for hybrid learning and reasoning systems. Journal of Web Engineering, 18(1-3), 97-123.
[2] Fajar J. Ekaputra et al. Describing and Organizing Semantic Web and Machine Learning Systems in the SWeMLS-KG. The Semantic Web: 20th International Conference, ESWC 2023
[3] Steffen Lohmann et al. "WebVOWL: Web-based visualization of ontologies." In Knowledge Engineering and Knowledge Management: EKAW 2014.
[4] Serge Chávez-Feria et al. “Chowlk: from UML-based ontology conceptualizations to owl.” The Semantic Web: 19th International Conference, ESWC 2022.
6. An extended analysis of user requirements for explainable smart energy systems.
Supervisors: Katrin Schreiberhuber, Marta SabouKeywords: user requirements, explainability, smart energy systems, statistical data analysis
Context: Smart energy systems have emerged as a promising solution for optimizing energy consumption, reducing costs, and minimizing environmental impacts. These systems leverage advanced technologies such as IoT sensors, data analytics, and automation to efficiently manage energy resources. However, for the successful adoption and acceptance of these systems, it is crucial to understand the requirements and concerns of the end users, experts, and technicians who interact with them. One critical aspect that needs investigation is the importance of explainability in smart energy systems, as it directly impacts user trust and decision-making.
Problem: The research problem revolves around comprehending user requirements for smart energy systems and evaluating the significance of explainability to different types of end users, based on the results of a user survey.
Goal/expected results of the thesis.
The primary objective of this thesis is to perform a detailed analysis of a user survey that has already been conducted by a previous student. The analysis should be complemented by a literature review on the importance of user-centered explainability in smart (energy) systems. The outcomes should provide insights into how different user groups perceive explainability and how it influences their interaction with smart energy systems.
Potential Research Questions:
What are the specific needs and expectations of users when interacting with smart energy systems in real-world scenarios?
How critical is explainability in fostering user trust and acceptance of smart energy systems? Does the importance of explainability vary among different user groups?
How does a user's background affect their need for explainability or the types of explanations they prefer?
Methodology:
Literature Review: Investigate existing research on explainable systems, user-centered explanations, and the role of explainability in enhancing user acceptance and trust in smart systems.
Statistical Analysis: Conduct a comprehensive statistical analysis of the survey results to validate hypotheses related to the importance of explainability for different user groups.
Required Skills:
Good understanding of statistical analysis methods and implementation tools (R or python preferably)
Literature review skills, including the ability to critically analyse and synthesize existing research.
References:
O’Dwyer, Edward, Indranil Pan, Salvador Acha, and Nilay Shah. Smart Energy Systems for Sustainable Smart Cities: Current Developments, Trends and Future Directions. Applied Energy 237 (March 1, 2019): 581–97. https://doi.org/10.1016/j.apenergy.2019.01.024.
Maguire, M., Bevan, N. (2002). User Requirements Analysis. In: Hammond, J., Gross, T., Wesson, J. (eds) Usability. IFIP WCC TC13 2002. IFIP — The International Federation for Information Processing, vol 99. Springer, Boston, MA. doi.org/10.1007/978-0-387-35610-5_9
Jha, S. S., Mayer, S., & García, K. (2021, November). Poster: Towards explaining the effects of contextual influences on cyber-physical systems. In Proceedings of the 11th International Conference on the Internet of Things (pp. 203-206).
7. Enhancing Neural Networks with Ontologies and Knowledge Graphs: A Comprehensive Survey
Supervisors: Majlinda Llugiqi, Marta SabouMain idea: Review and analyze existing literature and methods that utilize ontologies and knowledge graphs as tools to improve the architecture, performance, and interpretability of neural networks.
Motivation: While neural networks have achieved significant success in numerous applications, their black-box nature and data-driven essence can sometimes result in less transparency and domain-specificity. Ontologies and knowledge graphs, encapsulating structured domain knowledge, can potentially address these gaps. An in-depth survey of existing methods will offer clarity on the advancements and challenges in this interdisciplinary domain.
Research Questions:
How have ontologies and knowledge graphs been historically employed to enhance neural networks?
What are the primary benefits reported in using structured knowledge to enhance neural network models?
Which specific neural network architectures or domains (e.g., NLP, Computer Vision) have most extensively adopted these methods?
What challenges and limitations have researchers faced when integrating ontologies and knowledge graphs with neural networks?
Expected Tasks:
Conduct a systematic literature review to identify key papers and works in the domain.
Categorize the methods based on the specific application (e.g., weight initialization, network regularization, interpretability).
Analyze the reported advantages and challenges for each method or approach.
Summarize the domains and neural network architectures that have seen significant ontology and knowledge graph integration.
Discuss the potential future directions
Prior-Knowledge and Skills:
Comprehensive understanding of neural networks and their architectures.
Familiarity with ontology structures, knowledge graph representations, and their applications.
Analytical and critical reading skills to discern the quality and relevance of research works.
References:
[1] Sheth, Amit, et al. "Shades of knowledge-infused learning for enhancing deep learning." IEEE Internet Computing 23.6 (2019): 54-63.
[2] Tiddi, Ilaria, and Stefan Schlobach. "Knowledge graphs as tools for explainable machine learning: A survey." Artificial Intelligence 302 (2022): 103627.
[3] Gaur, Manas, Keyur Faldu, and Amit Sheth. "Semantics of the black-box: Can knowledge graphs help make deep learning systems more interpretable and explainable?." IEEE Internet Computing 25.1 (2021): 51-59.
Keywords: Neural Network Enhancements, Ontologies, Knowledge Graphs, Knowledge Infusion
8. Leveraging Large Language Models to Model Ontologies for Data Augmentation in Tabular Classification Tasks
Supervisors: Majlinda Llugiqi, Marta SabouMain idea: This thesis explores the use of Large Language Models (LLMs) to model domain-specific ontologies based on dataset features. These ontologies will then be populated with instances from the same datasets and used to augment the training data for tabular classification tasks. The aim is to evaluate whether LLM-generated ontologies can enhance the performance of machine learning (ML) models by providing structured semantic knowledge in addition to the raw tabular data.
Motivation: Ontologies provide a formal and structured way to represent domain knowledge, which can improve the generalization ability of machine learning models. LLMs, with their ability to understand context and structure, can help automate the creation of such ontologies. By modeling ontologies that capture the relationships between dataset features and augmenting training data with this semantic knowledge, this thesis aims to reduce the limitations of purely tabular data and enhance model performance in classification tasks. This approach is particularly relevant for improving data representation and classification accuracy in domains where relationships between features are non-explicit in the raw data.
Research Questions:
How effectively can Large Language Models generate domain-specific ontologies based on tabular dataset features?
Does augmenting the training data with instances from ontologies generated using LLMs improve the performance of ML models in tabular classification tasks?
How do LLM-generated ontologies compare with manually crafted ontologies in terms of accuracy, consistency, and usefulness for data augmentation?
Expected Tasks:
Conduct a literature review on the use of ontologies in machine learning and the application of LLMs for ontology modeling.
Select tabular datasets and identify key features for ontology modeling.
Use LLMs to model ontologies based on these features and populate them with instances from the datasets.
Augment the training data with the ontological instances and evaluate their impact on the performance of classification models (e.g., decision trees, SVM).
Compare the results with baseline models trained on the original tabular data.
Document the findings and provide recommendations for future research.
Prior-Knowledge and Skills:
Basic knowledge of machine learning algorithms (classification, regression, etc.).
Familiarity with Large Language Models (e.g., GPT) and their applications in natural language processing.
Familiarity with ontology modeling and knowledge graphs (course K2 in the SBWL KM);
Data analysis and evaluation skills.
References:
[1] Llugiqi, Majlinda, Fajar J. Ekaputra, and Marta Sabou. "Enhancing Machine Learning Predictions Through Knowledge Graph Embeddings." International Conference on Neural-Symbolic Learning and Reasoning. Cham: Springer Nature Switzerland, 2024.
[2] Kommineni, Vamsi Krishna, Birgitta König-Ries, and Sheeba Samuel. "From human experts to machines: An LLM supported approach to ontology and knowledge graph construction." arXiv preprint arXiv:2403.08345 (2024).
Keywords: Large Language Models, Ontology Modeling, Data Augmentation, Tabular Classification, Machine Learning
9. Identifying Key Concepts in Student-authored ontologies
Supervisor: Marta SabouKeywords: semantic web, ontologies, natural categories, key concepts, cognition
Context: Ontologies are machine-actionable domain descriptions that underpin a variety of intelligent applications by capturing the most important concepts in a subject domain and their relationships. Naturally, however, there are various ways to define such ontologies and ontologies built to cover the same domain often differ (significantly) one from the other.
An interesting research question refers to understanding the overlaps between ontologies covering the same domain and built by different ontology engineers. Understanding such commonalities could shed light on shared cognitive structures and processes among ontology engineers. In particular, our hypothesis is that ontologies will share several concepts corresponding to natural categories as defined in cognitive science [1] or coined as key concepts in the ontology engineering community [2]. Yet, to the best of our knowledge, this question has not been investigated so far, potentially due to the lack of collections of ontologies that define the same domain.
To overcome this lack of example ontologies, in our team we have been collecting ontologies created by junior ontology engineers (i.e., students) during ontology engineering courses. This ontology corpus now enables investigating commonalities between ontologies describing the same domain.
Problem: There is currently limited understanding of whether and to what extent do junior ontology engineers make use of key concepts while creating their ontologies.
Goal/expected results of the thesis.
The thesis is expected to provide insights into ontology modeling behavior within student populations. We are particularly interested whether there is an overlap of these ontologies in terms of “key concepts”. The results of the study will help us understand cognitive processes during ontology modeling, which at the later stage will be compared to modeling ontologies with non-human agents, i.e., LLMs.
Research Question: To what extent do student-authored ontologies overlap in terms of the key concepts that they make use of? Do results vary based on the domain?
Methodology:
Collect a set of student-authored ontologies from the existing collection of such ontologies (this will be provided by supervisor)
Identify top key concepts in the selected ontologies. This step can be performed either:
Automatically: by re-implementing the methods that identify key concepts described in [2]; or
Manually: as a fallback option for students with limited programing skills
Identify and analyze overlaps in key concepts to verify the thesis hypothesis
Required Skills:
Good understanding of ontologies (completed K2 of SBWL Knowledge Management is a must!)
Sufficient programming skills for using ontology processing libraries
References:
[1] Rosch, E. Principles of Categorization, Cognition and Categorization, Lawrence Erlbaum, Hillsdale, New Jersey, 1978.
[2] Silvio Peroni, Enrico Motta, and Mathieu D'Aquin. 2008. Identifying Key Concepts in an Ontology, through the Integration of Cognitive Principles with Statistical and Topological Measures. In Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web (ASWC '08). Springer-Verlag, Berlin, Heidelberg, 242–256. https://doi.org/10.1007/978-3-540-89704-0_17
10. LLM-based verification of ontology restrictions
Supervisors: Stefani Tsaneva, Marta SabouKeywords: semantic web, ontology evaluation, large language models
Context: The knowledge corpus of AI systems typically relies on conceptual domain knowledge structures such as ontologies, which are conceptual data structures representing a domain of interest. Low-quality ontologies that include incorrectly represented information or controversial concepts modeled only from a single viewpoint can lead to invalid or biased system outputs, thus negatively impacting the trustworthiness of the enabled AI system.
To avoid such cases, intense work has been performed in the last decades in the area of ontology evaluation leading to a variety of automatic techniques (e.g., for the detection of syntax errors, hierarchy cycles, logical inconsistencies) as well as the realization that several quality aspects (e.g., unintended use of modeling elements, incorrect domain knowledge, viewpoints) can only be tested by involving a human-in-the-loop (HiL).
One particular example is the verification of ontology restrictions defined with universal and existential quantifiers. The use of these quantifiers is not trivial and often leads to ontology defects. Currently, such defects can only be detected and repaired by involving a human curator. Although HiL approaches achieve high accuracy for this task, they are typically time-consuming and resource-intensive.
Recently, there have been impressive advancements in AI-powered chatbots, including ChatGPT, which has demonstrated remarkable abilities in language processing and response generation. Thus, the question arises of whether ChatGPT can support ontology verification tasks.
Problem: There is currently limited experimental investigation of how large language models, such as ChatGPT, can support the verification of ontology restrictions.
Goal/expected results of the thesis.
The thesis is expected to provide insights into the effectiveness of ChatGPT in ontology restriction verification. The results of the study will help us understand the advantages and limitations of ChatGPT compared to a traditional HiL approach.
Research Question: How effective is ChatGPT in verifying ontology restrictions when provided with enough instructions and context? Does the performance vary based on the modeled domain?
Methodology:
Experiment A: Replication of a previous LLM-based investigation of ontology restriction [2,3]
Collection of additional ontology axioms
Experiment B: Differentiated replication of the first experiment with the new dataset
Comparison between the results obtained in Experiment A and B
Required Skills:
Good understanding of ontologies, especially ontology restrictions (completed K2 of SBWL Knowledge Manage,ent is a must!)
Some basic understanding of how large language models work
References:
[1] Rector, A. et al. (2004). OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors & Common Patterns. In: Motta, E., Shadbolt, N.R., Stutt, A., Gibbins, N. (eds) Engineering Knowledge in the Age of the Semantic Web. EKAW 2004. Lecture Notes in Computer Science(), vol 3257. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30202-5_5
[2] S. Tsaneva, S. Vasic, and M. Sabou, “LLM-driven Ontology Evaluation: Verifying Ontology Restrictions with ChatGPT,” in The Semantic Web: ESWC Satellite Events, 2024, 2024. https://dqmlkg.github.io/assets/paper_1.pdf
[3] S. Vasic, “ChatGPT vs Human-in-the-loop: An approach towards automated verification of ontology restrictions”, Vienna University of Economics and Business, 2023 ,Bachelor Theiss https://drive.google.com/file/d/1mvKmTS3dcOe_nbZzn5FP1EDAaH6UgM8X/view
[4] B. P. Allen, P. T. Groth, Evaluating class membership relations in knowl- edge graphs using large language models, in: The Semantic Web: ESWC Satellite Events, 2024.https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770011.pdf
[5] N. Fathallah, A. Das, S. De Giorgis, A. Poltronieri, P. Haase, L. Kovrigu- ina, NeOn-GPT: A large language model-powered pipeline for ontology learning, in: The Semantic Web: ESWC Satellite Events, 2024. https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf
[6] C.-H. Chiang, H.-y. Lee, Can large language models be an alternative to human evaluations?, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023. 10.18653/v1/2023.acl-long.870
11. LLM usage when learning Ontology Engineering
Supervisors: Stefani Tsaneva, Marta SabouKeywords: semantic web, ontology evaluation, large language models
Main idea: The thesis aims to collect qualitative insights into how beginners in ontology engineering make use of applications supported by large language models.
Background: Ontologies conceptualise real-world knowledge and act as a foundational component in many advanced intelligent applications (e.g., search, decision support) harnessing human knowledge. Ontology engineering, the process of developing ontologies, is a time intensive task containing several activities which can potentially be computationally supported [2].
Meanwhile, large language models (LLMs) have shown performance similar to humans on a number of natural language tasks, typically requiring commonsense or domain knowledge. With recent advances of LLMs and their application in a broad range of tasks, an interest into the synergy between LLMs and ontology engineering has emerged [1].
To better understand the extent to which LLMs can currently support the ontology engineering process, the thesis will focus on collecting information about how students learning to build ontologies make use of LLM-based tools when developing their semantic artefacts.
Research Question: How do novice ontology engineers make use of LLM-supported tools when performing ontology engineering tasks?
Methodology: Literature review + Semi-structured interviews/Focus group
Expected Tasks:
Read literature on collaborative ontology engineering tools supported by LLMs
Conduct interviews/a focus group which students who completed K2 in the Knowledge Management SBWL on their experience in using ChatGPT and other LLM-based tools
Summarize the findings and identify trends (e.g. which tools used for which task)
Skills:
Understanding of ontologies and semantic web technologies (course K2 in the SBWL KM);
Active listening skills and conversational skills
References:
[1] Fabian Neuhaus: Ontologies in the era of large language models – a perspective. Applied Ontology, vol 18, no. 4, 2023, pp. 399–407. DOI 10.3233/AO-230072 .
[2] Zhang, B., Carriero, V.A., Schreiberhuber, K., Tsaneva, S., Gonz'alez, L.S., Kim, J., & Berardinis, J.D. (2024). OntoChat: a Framework for Conversational Ontology Engineering using Language Models.DOI 10.48550/arXiv.2403.05921
[3] N. Fathallah, A. Das, S. De Giorgis, A. Poltronieri, P. Haase, L. Kovrigu- ina, NeOn-GPT: A large language model-powered pipeline for ontology learning, in: The Semantic Web: ESWC Satellite Events, 2024. https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf
12. LLMs for user assistance in gathering domain knowledge
Supervisors: Katrin Schreiberhuber, Marta SabouKeywords: semantic web, cyber physical systems, LLMs, user assistance, explainability
Context: Cyber Physical Systems integrate computational and physical components. They have emerged as promising solutions to solve complex problems and efficiently control systems in various domains, such as smart grids, smart buildings or manufacturing. These systems leverage advanced technologies such as IoT sensors, data analytics, and automation to efficiently manage resources. The concept of Explainable Cyber-Physical Systems aims to provide clear explanations for system decisions and actions, which is crucial for understanding, managing, and controlling high-risk Cyber-Physical Systems. However, these concepts are highly dependent on domain-expert knowledge, which is hard to elicit and conceptualise from expert workshops.
Recently, there have been impressive advancements in AI-powered chatbots, including ChatGPT, which has demonstrated remarkable abilities in language processing and response generation. Thus, the question arises of whether ChatGPT can support the task of gathering domain knowledge from experts.
Problem: Currently, domain knowledge elicitation is a time-consuming, manual and collaboration-intensive task, which requires time and resources from multiple stakeholders for explainable cyber physical systems.
Goal/expected results of the thesis.
In this thesis, the goal is to investigate the potential of LLMs to assist domain experts at different stages of creating an explainable cyber-physical system. Based on an existing user guideline, the thesis should identify multiple options to use the power of LLMs to help experts and users in their tasks.
Research Question: To what extent can LLMs assist domain experts in conceptualizing their knowledge for the creation of Explainable Cyber Physical systems? Which knowledge elicitation tasks benefit the most from the use of LLMs in the process of setting up an explainable cyber-physical system?
Methodology:
Literature Review: Investigate existing research on explainable systems, domain knowledge elicitation and the LLMs for user assistance
Prototyping and Evaluation: Establish and evaluate various setups to use LLMs as an assistant for domain experts in setting up an Explainable Cyber Physical System.
Required Skills:
Literature review skills, including the ability to critically analyse and synthesize existing research.
LLM prompting strategies
References:
Jha, S. S., Mayer, S., & García, K. (2021, November). Poster: Towards explaining the effects of contextual influences on cyber-physical systems. In Proceedings of the 11th International Conference on the Internet of Things (pp. 203-206).
Chari, Shruthi, et al. "Directions for explainable knowledge-enabled systems." Knowledge graphs for explainable artificial intelligence: Foundations, applications and challenges. IOS Press, 2020. 245-261.
13. Leveraging Large Language Models for Information Extraction and Knowledge Graph Generation from Disaster Accident Reports: Implications for Transport Analysis.
Supervisors: Shahrom Sohi, Hannah Schuster, Amin AnjomshoaaThis thesis combines LLM technologies with practical applications in the transport sector, making it both academically challenging and industry relevant. By systematically extracting and verifying information from accident reports (Train accidents, Airport accidents, Floods, Fires etc.) and then translating that information into knowledge graphs for analysis, students will gain hands-on experience in:
Natural Language Processing
Data Verification Techniques
Knowledge Graph Construction
Data Analysis and Visualization
Critical Thinking in Digital Economics
Transportation analysis
Background
Accident reports are resources for understanding the causes of accidents and implementing preventive measures. In Railway these sources of data are public available in each EU member state (Directive (EU) 2016/798 of the European Parliament and of the Council of 11 May 2016 on Railway Safety (Recast) (Text with EEA Relevance) 2016) (website view: https://www.era.europa.eu/agency/stakeholder-relations/national-investigation-bodies/nib-network-european-network-rail-accidents-national-investigation-bodies_en)
Making systematic analysis challenging and time-consuming. The advent of Large Language Models (LLMs) like GPT-3 and GPT-4 has significantly advanced natural language processing, enabling more effective extraction of information from unstructured texts (Brown et al. 2020). Nonetheless, ensuring the accuracy and reliability of information extracted by LLMs, especially in specialized domains like transport safety, disasters management, remains a critical concern (Eloundou et al. 2023; Liu et al. 2021; Mensa et al., 2022.). Knowledge Graphs (KGs) have emerged as powerful tools for representing complex relationships within data, facilitating advanced analytics and insights across various fields (Hogan et al. 2022). By integrating LLM-based information extraction with KG generation, there is a promising opportunity to enhance transport analysis, leading to improved safety outcomes and operational efficiencies.
References
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. ‘Language Models Are Few-Shot Learners’. arXiv. doi.org/10.48550/arXiv.2005.14165.
Directive (EU) 2016/798 of the European Parliament and of the Council of 11 May 2016 on Railway Safety (Recast) (Text with EEA Relevance). 2016. OJ L. Vol. 138. data.europa.eu/eli/dir/2016/798/oj/eng.
Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. ‘GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models’. arXiv. doi.org/10.48550/arXiv.2303.10130.
Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, et al. 2022. ‘Knowledge Graphs’. ACM Computing Surveys 54 (4): 1–37. doi.org/10.1145/3447772.
Liu, Jintao, Felix Schmid, Keping Li, and Wei Zheng. 2021. ‘A Knowledge Graph-Based Approach for Exploring Railway Operational Accidents’. Reliability Engineering & System Safety 207 (March):107352. doi.org/10.1016/j.ress.2020.107352.
Mensa, Enrico, Daniele Liberatore, Davide Colla, Matteo Delsanto, Marco Giustini, and Daniele P Radicioni. n.d. ‘Road Accidents: Information Extraction from Clinical Reports’.
14. Investigating the Connection Between Train Accidents and Weather Warnings: A Data-Driven Approach
Supervisors: Hannah Schuster, Shahrom Sohi, Amin AnjomshoaaBackground and Motivation:
Severe weather events are occurring more frequently and are projected to increase in both intensity and frequency due to climate change. This trend has considerable implications for infrastructure, particularly transportation networks, including train services. Recent events, such as the severe rainfall event in Austria from September 12–16, 2024, where five times the typical monthly rainfall occurred within five days, highlight the potential for widespread disruption. These floods caused damage to train services, with some lines remaining non-operational weeks after the event.
Given the increasing number of extreme weather events, it is important to investigate how such conditions impact train operations. This research will focus on identifying a potential correlation between weather warnings and train accidents, using accident reports provided by a train service provider.
Research Question:
To what extent are train accidents linked to weather warnings, and how can the alignment of these datasets reveal meaningful correlations between severe weather events and train accidents?
Objectives:
Data Alignment:
Develop a methodology to align the accident reports dataset from the train service provider with the weather warnings dataset in a meaningful way. This will involve determining how to match weather warnings (time, location, type of weather event) with accident reports (date, time, location, cause).
Data Analysis:
Analyze accident reports to identify accidents that occurred during or shortly after weather warnings were issued. This involves evaluating different types of weather warnings (e.g., rainfall, storms, snow) and how frequently they coincide with train accidents.
Correlation Study:
Investigate the statistical correlation between the occurrence of weather warnings and the frequency or severity of train accidents. This could include identifying whether specific weather events are associated with an increase in certain types of accidents or delays.
15. Advanced visualizations for process models
Supervisor: Maxim VidgofBackground:
Process models are an essential way to communicate normative and descriptive business processes specifications. While several open-source visualization tools exist, they lack some advanced visualization functionality and are not directly compatible with each other.
Research problem:
In this thesis, the student will develop a unified library for customized process model visualizations. The library should allow visualizing different types of process models, including Directly-Follows Graphs, BPMN models and Workflow nets.
Prerequisites:
Python and JavaScript knowledge are strict requirements, basic knowledge of Process Mining is a plus.
References:
[1] Berti, A., van Zelst, S., & Schuster, D. (2023). PM4Py: A process mining library for Python. Software Impacts, 17, 100556.
[2] bpmn-js: BPMN 2.0 viewer and editor. bpmn.io/toolkit/bpmn-js/
16. Comparing discovered process models beyond structural properties
Supervisor: Maxim VidgofBackground:
Comparing process models is an essential task in process analytics. Several approaches to comparing the models exist, however, they tend to only focus on the structural features of the models. Recently, an approach has been proposed to compare process models beyond their structural properties, e.g. by incorporating organizational or data dimensions as well. However, this approach seems to be most suited for normative process models.
Research problem:
In this thesis, the task is to evaluate to which extent existing model comparison methods are applicable to models discovered using Process Mining and adjust the methods for such models if necessary.
Prerequisites:
Programming skills (Python or Java) and completion of (or enrollment in) courses “Process Management for Information Systems” and “Formal Foundations of Information Systems” are required, basic knowledge of Process Mining is a plus.
References:
[1] Schützenmeier, N., Jablonski, S., Schönig, S. (2024). Comparing Process Models Beyond Structural Equivalence. In: Almeida, J.P.A., Di Ciccio, C., Kalloniatis, C. (eds) Advanced Information Systems Engineering Workshops. CAiSE 2024. Lecture Notes in Business Information Processing, vol 521. Springer, Cham. doi.org/10.1007/978-3-031-61003-5_25
[2] van der Aalst, W.M.P., de Medeiros, A.K.A., Weijters, A.J.M.M. (2006). Process Equivalence: Comparing Two Process Models Based on Observed Behavior. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds) Business Process Management. BPM 2006. Lecture Notes in Computer Science, vol 4102. Springer, Berlin, Heidelberg. doi.org/10.1007/11841760_10
17. Evaluating the impact of reward strategy on Reinforcement Learning-based Predictive Process Monitoring
Supervisor: Maxim VidgofBackground:
Predictive Process Monitoring (PPM) is a subfield of Process Mining focusing on predicting the future behavior of running process instances. Suffix prediction is a type of PPM aiming to predict the entire sequence of activities to be executed until the instance completion. Recently, a powerful approach based on Reinforcement Learning (RL) has been proposed. However, it has some potential for improvement.
Research problem:
In this thesis, the student is expected to adapt the existing suffix prediction technique by changing its reward strategy. While in the existing approach the reward is only computed as the entire suffix is predicted, it might be beneficial to compute it after each predicted activity. The task is then to evaluate the impact of the reward frequency on the accuracy and time performance of the approach.
Prerequisites:
Python and some knowledge of Reinforcement Learning are required, basic knowledge of Process Mining is a plus.
References:
[1] Rama-Maneiro, E., Patrizi, F., Vidal, J., Lama, M. (2024). Towards Learning the Optimal Sampling Strategy for Suffix Prediction in Predictive Monitoring. In: Guizzardi, G., Santoro, F., Mouratidis, H., Soffer, P. (eds) Advanced Information Systems Engineering. CAiSE 2024. Lecture Notes in Computer Science, vol 14663. Springer, Cham. doi.org/10.1007/978-3-031-61057-8_13
18. Representation through Deliberation: The Vienna Klima-Teams as a Case-Study of Deliberative Democracy
Supervisor: Jan MalyCAUTION: This thesis will need to be finished by the end of February as the supervisor will be on parental leave next semester.
Context: Citizen assemblies offer an alternative way for making political decisions by bringing together a broadly representative selection of citizens, chosen by lottery, that deliberates on a specific topic. One key promise of citizen assemblies is that the diversity of perspectives in the population is represented in the composition of the assembly [1]. However, do citizen assemblies indeed lead to more representative outcomes? This question is hard to answer empirically for most citizen assemblies, as we are lacking a way to measure the representativeness of their outcomes. However, the Vienna Klimateams [2], which combine participatory budgeting with citizen assemblies in a novel way, offer a perfect test case to gain insights into the representativeness of deliberative democracy, by measuring the representativeness of the outcomes using well established metrics of fairness developed for participatory budgeting [3].
Problem: There is currently no empirical evidence that citizen assemblies lead to representative outcomes.
Goal/expected results of the thesis.
In this thesis, we will analyze whether the outcome of the Klimateam citizen assemblies was representative of the whole district by comparing the location of the selected projects with population density data and other socio-economic measures.
Research Question: Did the Vienna Klimateam citizen assemblies produce results that adequately represent the diversity of the districts?
Methodology:
Translate the publicly available data on the outcomes of the Vienna Klima-Teams into a machine-readable data set that also contains location information.
Search for the fitting socio-economic data on Vienna in open data repositories.
Use data analytic methods and numerical fairness measures from the participatory budgeting literature to analyze these data sets.
Required Skills:
Good understanding of data analysis, ideally with python.
Willingness to learn about mathematical measures of fairness, like the Gini-Coefficient.
Initial reading list:
[1] Hélène Landemore; Deliberative Democracy as Open, Not (Just) Representative Democracy. Daedalus 2017; 146 (3): 51–63
[2] klimateam.wien.gv.at
[3] The (Computational) Social Choice Take on Indivisible Participatory Budgeting, Simon Rey and Jan Maly, 2023
19. Structured Literature Review Methodologies: What Makes a Survey Paper in the Age of AI?
Supervisors: Daniil Dobriy, Axel PolleresBackground:
The advent of generative AI and large language models has significantly transformed many aspects of the research ecosystem, including the process of conducting literature reviews. Traditionally, structured literature reviews (SLRs) have been a cornerstone of academic research, providing comprehensive overviews of existing knowledge in a particular field. They serve as critical tools for identifying research gaps, synthesizing findings, and guiding future research directions.
However, the landscape of literature reviews is rapidly evolving. AI-powered services like Perplexity and others are emerging with the promise of automating parts of the literature review process. These tools leverage natural language processing, machine learning, and vast databases to quickly scan, summarize, and even analyze large volumes of academic literature. This shift presents both opportunities and challenges for researchers.
On one hand, AI tools have the potential to significantly reduce the time and effort required for certain aspects of literature reviews, such as initial searching and filtering of relevant papers. They may also help in identifying patterns or connections that human researchers might overlook. On the other hand, there are concerns about the depth of understanding, critical analysis, and nuanced interpretation that AI can provide compared to human researchers.
In this context, it becomes crucial to re-examine the methodologies of structured literature reviews. We need to understand which aspects of SLRs are most suitable for automation, which require human expertise, and how these can be integrated to create more efficient and effective review processes.
Research problem:
The primary research problem for this thesis can be formulated as follows:
How can the methodologies of structured literature reviews be adapted and optimized in the age of AI to leverage automation while maintaining quality and depth of analysis?
Three initial references:
Bolanos, F., Salatino, A., Osborne, F., & Motta, E. (2024). Artificial intelligence for literature reviews: Opportunities and challenges. arXiv preprint arXiv:2402.08565.
Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS quarterly, xiii-xxiii.
Watson, R. T., & Webster, J. (2020). Analysing the past to prepare for the future: Writing a literature review a roadmap for release 2.0. Journal of Decision Systems, 29(3), 129-147.
It is ironic to note that the paper [Watson & Webster 2020] seems to be wholly oblivious to the existence of Semantic Web Standards, which support such academic Knowledge Graphs as the Open Research Knowledge Graph (ORKG) and scientific Wikidata. This oversight is particularly striking given that the paper itself is proposing methods for automating literature reviews using graph-based approaches.
Keywords:
Structured Literature Review, Generative AI, Academic Knowledge Graphs, Semantic Web, Research Ecosystem
Prior Knowledge & Requirements:
Foundational understanding of structured literature reviews (SLRs)
Wish to aim for an academic publication based on the results of the thesis
If you choose this topic, you should be good with setting and keeping deadlines :)
20. Wikibase Cloud: Collecting and Integrating Knowledge Graphs from Wikibase Instances
Supervisors: Daniil Dobriy, Axel PolleresBackground:
Linked Open Data (LOD) has become a cornerstone of the Semantic Web, with Wikibase emerging as a significant platform for creating and managing Knowledge Graphs (KGs), especially among governmental agencies and GLAM institutions. As an offshoot of the software powering Wikidata, Wikibase has seen growing adoption, forming its own ecosystem within the broader LOD landscape.
The Wikibase ecosystem is expanding rapidly, becoming an increasingly important source of LOD, potentially rivaling traditional data providers in terms of volume and diversity of information. As the number of Wikibase instances grows, so does the interlinking between different KGs.
Extracting structured data from Wikibase instances presents both opportunities and challenges. Unlike Semantic MediaWiki, Wikibase offers a more standardized data model and API, facilitating easier data extraction. However, the specifics of the Wikidata data model necessitates robust extraction and integration methods.
Additionally, analyzing the ontology links, complexity, interlinking patterns, and temporal evolution of Wikibase instances can provide valuable insights into the growth and development of this ecosystem, potentially revealing trends in collaborative knowledge creation and the evolution of domain-specific Knowledge Graphs and Enterprise Knowledge Graphs.
Research problem:
The primary research problem for this thesis is to develop and implement a methodology for systematically collecting, integrating, and analyzing data from diverse Wikibase instances to create a comprehensive "Wikibase Cloud" corpus. This involves addressing challenges in data extraction and metadata aggregation across multiple Wikibase instances. Additionally, the study seeks to compare the Wikibase Cloud with other LOD corpora, such as the LOD Cloud and SMW Cloud, to better understand the unique attributes of Wikibase-based Knowledge Graphs in the broader context of Linked Data.
Three initial references:
Dobriy, D., Beno, M., & Polleres, A. (2024, May). SMW Cloud: A Corpus of Domain-Specific Knowledge Graphs from Semantic MediaWikis. In European Semantic Web Conference (pp. 145-161). Cham: Springer Nature Switzerland.
Haller, A., Polleres, A., Dobriy, D., Ferranti, N., & Rodríguez Méndez, S. J. (2022, May). An analysis of links in Wikidata. In European Semantic Web Conference (pp. 21-38). Cham: Springer International Publishing.
Polleres, A., Pernisch, R., Bonifati, A., Dell'Aglio, D., Dobriy, D., Dumbrava, S., ... & Wachs, J. (2023). How does knowledge evolve in open knowledge graphs?. Transactions on Graph Data and Knowledge, 1(1), 11-1.
Keywords:
Wikibase, Wikidata, Linked Open Data, Knowledge Graph, Ontology Links, Knowledge Graph Evolution
Prior Knowledge & Requirements:
Knowledge of what Wikidata is and some experience working with it
Wish to aim for an academic publication based on the results of the thesis
If you choose this topic, you should be good with setting and keeping deadlines :)
21. Benchmarking LLMs for Structured Data Extraction from Scientific Articles
Supervisors: Daniil Dobriy, Axel PolleresBackground:
The advent of Large Language Models (LLMs) and generative AI has significantly transformed the landscape of information extraction. These advanced AI models have shown remarkable capabilities in ”understanding" human-like text, opening up new possibilities for automating various aspects of the research ecosystem.
In the realm of scientific literature, the potential for LLMs to assist in extracting structured data from academic papers is particularly promising. Traditional methods of data extraction from scientific articles often involve manual curation or rule-based systems, which can be time-consuming and may struggle with the complexity and diversity of scientific writing.
LLMs, with their ability to understand context, could potentially overcome many of these limitations, given they are supported by a suitable Retrieval-Augmented Generation approach. They offer the possibility of more adaptable extraction of structured data from the unstructured text of scientific papers. This could greatly accelerate the process of knowledge synthesis and discovery in scientific research, enhancing the population and quality of scientific knowledge graphs such as the Open Research Knowledge Graph (ORKG) [Jaradeh et al., 2019] and scientific Wikidata.
However, the application of LLMs to this task is not without challenges. Scientific papers often contain specialized terminology, complex concepts, and domain-specific knowledge that may be challenging even for advanced AI models. Moreover, the extraction of truly structured data requires not just understanding the text, but also mapping it to appropriate ontologies.
The reliability and accuracy of LLM-based extraction methods also need rigorous evaluation, especially given the high stakes of scientific information. There's a need for benchmarking these models against human expert performance and existing automated methods.
Research problem:
How can Large Language Models be effectively leveraged and evaluated for the task of extracting structured data from scientific articles?
Including:
What ontologies are most suitable for structuring knowledge extracted from academic papers across various scientific domains?
How can a Retrieval-Augmented Generation (RAG) pipeline be engineered to effectively extract structured knowledge from scientific articles using LLMs?
How does the performance of LLM-based extraction compare to traditional methods and human expert curation?
Three initial references:
Mihindukulasooriya, N., Tiwari, S., Dobriy, D., Nielsen, F. A., Chhetri, T. R., & Polleres, A. (in press). Scholarly Wikidata: Population and exploration of conference data in Wikidata using LLMs. In Proceedings of the 24th International Conference on Knowledge Engineering and Knowledge Management. https://dobriy.org/papers/scholarly_wikidata.pdf
Dobriy, D. (2024). Employing RAG to create a conference knowledge graph from text. In Proceedings of the 3rd International Workshop on Knowledge Graph Generation from Text (Text2KG). https://ceur-ws.org/Vol-3747/text2kg_paper4.pdf
Jaradeh, M. Y., Oelen, A., Farfar, K. E., Prinz, M., D'Souza, J., Kismihók, G., ... & Auer, S. (2019). Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th international conference on knowledge capture (pp. 243-246).
Keywords:
Retrieval-Augmented Generation, Information Extraction, Scientific Knowledge Graphs, ORKG, Wikidata
Prior Knowledge & Requirements:
Experience prompting LLMs, using API is of advantage
Wish to aim for an academic publication based on the results of the thesis
If you choose this topic, you should be good with setting and keeping deadlines :)
22. Evaluation of the Accuracy in Weather Warning Predictions
Supervisors: Amin Anjomshoaa, Hannah SchusterDue to the increasing frequency and intensity of natural disasters driven by climate change, weather warnings have become a critical aspectof public safety and disaster preparedness. In this context, timely andaccurate weather warnings play a crucial role in alerting populations toimminent dangers, reducing the potential loss of life and minimizingdamage to homes, roads, public utilities, and other critical infrastructure. In addition to helping the general public, weather warnings are vital for policymakers and crisis management teams. Decision-makers rely on these forecasts to assess potential risks and allocate resources more effectively.
Weather warnings are typically issued as part of a dynamic, evolving process, where initial alerts are followed by subsequent updates as more information becomes available and the situation develops. This iterative approach ensures that warnings remain accurate and relevant, reflecting the most up-to-date understanding of the natural event's trajectory,intensity, and potential impacts and recommendations.
The goal of this research is to apply machine learning techniques to evaluate the quality and effectiveness of weather warnings in Austria by analyzing weather data series and the weather warnings issued by the Austrian meteorology organization. To this end, the developed models should be able to identify patterns, correlations, and discrepancies between the actual weather conditions and the warnings issued.
References:
[1] https://data.hub.geosphere.at/
[2] Ren, X., Li, X., Ren, K., Song, J., Xu, Z., Deng, K., & Wang, X. (2021). Deep learning-based weather prediction: a survey. Big Data Research, 23, 100178.
[3] Tang, L., Li, J., Du, H., Li, L., Wu, J., & Wang, S. (2022). Big data in forecasting research: a literature review. Big Data Research, 27, 100289.
Keywords: Weather Prediction, Machine Learning, Data Analysis
23. Cookies Database
Supervisor: Amin AnjomshoaaWeb cookies are a fundamental component of modern web programming, playing a crucial role in the functionality of stateful web applications. By storing temporary data, cookies allow websites to maintain user sessions, remember login credentials, track preferences, and personalize the user experience across multiple interactions. However, despite their technical necessity, cookies are often exploited by businesses for data collection purposes. Many companies use cookies to track users’ online behavior, building detailed profiles of individuals to target them with advertisements, tailor content, or analyze user trends. This pervasive data collection, often carried out without users’ full awareness or consent, raises significant privacy concerns.
In response to these privacy issues, new regulations, particularly from the European Union (such as the General Data Protection Regulation, GDPR), have mandated stricter controls on how websites use cookies. One of the key requirements is that website owners must obtain explicit permission from users before storing cookies on their devices, especially for cookies that track personal information or are used for marketing purposes. As a result, users are now frequently confronted with cookie consent banners or popups when visiting websites, asking them to accept or decline different types of cookies. While these regulations are intended to give users more control over their personal data, the practical implementation has been chaotic and, in many cases, ineffective. Many users click through these prompts without fully understanding the implications of their choices or the types of data being collected.
The goal of this research is to address these challenges by developing an automated method for systematically analyzing the cookies used by websites. Specifically, this project aims to create a web crawler capable of scanning websites and extracting detailed information about the cookies they employ. This information would include the types of cookies in use, their specific purposes (e.g., functional, analytical, advertising), the data they collect, and how long they persist on users' devices.
References:
[1] Cookies Database, https://cookiedatabase.org/
[2] Dabrowski, A., Merzdovnik, G., Ullrich, J., Sendera, G., & Weippl, E. (2019). Measuring cookies and web privacy in a post-gdpr world. In Passive and Active Measurement: 20th International Conference, PAM 2019, Puerto Varas, Chile, March 27–29, 2019, Proceedings 20 (pp. 258-270). Springer International Publishing.
[3] Bollinger, D. (2021). Analyzing cookies compliance with the GDPR (Master's thesis, ETH Zurich).
Keywords: Web Cookies, Data Extraction, Data Analysis
24. LLM-based Analysis of Scientific Citations
Supervisor: Amin AnjomshoaaScientific papers play a central role in the advancement of science by serving as the primary medium for disseminating new findings, theories, and experimental results. Citations, which refer to the practice of referencing previous studies within a new research paper, are a critical component of this scientific process. They reflect how different research works are interconnected, demonstrating the progression and building upon past knowledge. By citing earlier studies, researchers acknowledge the contributions of others, showing the cumulative nature of scientific discovery. Citation networks can therefore reveal not only direct influences between studies but also the broader trends and shifts in scientific paradigms over time. Despite the crucial role that citations play in the scientific process, there are instances where citations are not conducted properly. Some authors may include irrelevant or inappropriate citations in their papers for a variety of reasons, which can undermine the integrity of scientific literature.
The primary objective of this research is to evaluate the quality of scientific citations within a repository of academic papers available at our institute. To achieve this, Large Language Model (LLM) techniques will be employed to analyze the textual content of both the source papers and the cited works. This analysis will focus on identifying relevance indicators for each individual citation, as well as generating an overall citation quality score for each paper. By leveraging LLM capabilities, we aim to provide a deeper understanding of citation relevance and quality.
References:
[1] Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content‐based citation analysis: The next generation of citation analysis. Journal of the association for information science and technology, 65(9), 1820-1833.
[2] Lagopoulos, A., & Tsoumakas, G. (2021). Self-citation Analysis using Sentence Embeddings. arXiv preprint arXiv:2105.05527.
Keywords: Scientific Papers, LLM, Data Analysis
25. LLM-based Analysis of Scientific Charts
Supervisors: Amin Anjomshoaa, Elmar KieslingThe charts included in scientific papers serve as a critical tool for conveying complex concepts, ideas, and research findings in a clear and accessible manner. They play a significant role by visually summarizing key results, trends, or relationships within a dataset, allowing readers to quickly grasp the essence of the research. In most cases, charts not only visualize raw data but also provide insights into the relationships between variables, statistical trends, or outcomes of experimental work. Because of this, the interpretation of charts is crucial for understanding the full scope of a scientific paper. The goal of this research is to develop methods for extracting detailed metadata from the charts included in scientific papers. This metadata includes essential information such as the source of the data used togenerate the chart, the specific variables that are represented, and the relevant descriptions or explanations of the chart provided in the text. This process will involve using Large Language Model (LLM) techniques for analyzing the text surrounding the charts, identifying the variables and their relationships, and linking the data presented in the charts tothe broader context of the paper's narrative.
References:
[1] Mukhopadhyay, S., Qidwai, A., Garimella, A., Ramu, P., Gupta, V., & Roth, D. (2024). Unraveling the Truth: Do LLMs Really Understand Charts? A Deep Dive into Consistency and Robustness. arXiv preprint arXiv:2407.11229.
[2] Li, S., & Tajbakhsh, N. (2023). Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349.
[3] Masry, A., Long, D. X., Tan, J. Q., Joty, S., & Hoque, E. (2022). Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244.
Keywords: Scientific Papers, LLM, Data Extraction
26. A Generic Data Space Architecture for Construction of Knowledge Graph
Supervisor: Amin AnjomshoaaThe concept of a Data Space refers to a flexible, integrated framework that allows different types of data from diverse sources to be accessed, managed, and linked without needing to be fully integrated into a single schema or database. It enables data transactions between different data ecosystem parties based on the governance framework of that data space [1]. Web of Data can therefore be seen as a realization of the dataspaces concept [2] on a global scale, relying on a specific set of web standards. As such it provides an incremental approach to data management, where the degree of data integration can evolve over time as needed, rather than requiring full integration from the outset.
The SOLID (Social Linked Data) project [3], initiated by Tim Berners-Lee, is closely aligned with the concept of Data Spaces, particularly in the context of personal data management and the semantic web. SOLID’s main goal is to give individuals control over their own data, by enabling the decentralized storage and management of personal information.
This research aims to create a decentralized, collaborative data architecture based on SOLID's data space concept, enabling on-demand generation of knowledge graphs by combining data from multiple sources. A Knowledge Graph is a structured representation of information based on Semantic Web and Linked Data concepts where entities (people, organizations, objects, etc.) are nodes, and their relationships are represented as edges. By fostering real-time collaboration between independent data spaces and using semantic technologies, this system would allow for the dynamic creation of structured, meaningful knowledge representations.
References:
[1] Dataspaces Support Center, https://dssc.eu/space/Glossary/176554052/2.+Core+Concepts
[2] Halevy, A., Franklin, M., & Maier, D. (2006, June). Principles of dataspace systems. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 1-9).
[3] SOLID Project, https://solidproject.org/
Keywords: Data Spaces, Linked Data
27. Open ranking algorithms for decentralized social media
Supervisor: Axel PolleresSocial media plattforms since Facebook and Twitter are often being criticised for their intransparent information filtering and ranking algorithms, which typically, so far have not been made available openly. On the other hand, with W3C standards such as ActivityStreams [1] and ActivityPub [2], there are also open protocols that allow to build fully open, decentralized social networks. Instantiations of these new protocols include recently more poularized platforms such as Mastodon [5] or Bluesy [6], that is based on Open Source Protocols allow to run and operate decentrally connected social media servers, however without the typical recommendation and ranking algorithms of centralised commercial services.
Goals of a thesis in this space could be to
a) reviews and understand these protocols
b) review and understand the principles of typical social media ranking algorithms (such as the recently openly published Twitter ranking algorithm [3,4])
c) implement a simple, personalised ranking algorithms for Mastodon
or a combination of at least two of these topics.
The topic could be worked upon by two students, complementing each others focus.
1. https://www.w3.org/TR/activitystreams-core/
2. https://www.w3.org/TR/activitypub/
3. https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm
4. https://github.com/twitter/the-algorithm
28. Enriching the CRISP Disaster Prevention Knowledge Graph
Supervisor: Axel Polleres, Amin AnjomshoaaData plays a critical role in crisis response and intervention efforts by providing decision-makers with timely, accurate, and actionable information. During a crisis, data can help organizations and crisis managers identify the most affected populations, track the spread of the crisis, and monitor the effectiveness of their response efforts.
The CRISP Knowledge Graph, constructed from various data resources provided by different stakeholders involved in crisis and disaster management presents a uniform view of infrastructure, networks, and services pertinent to crisis management use cases, accessible for instance via a SPARQL query interface.
In this theses, you should enrich this Knowledge Graph by linking it to other sources such as Wikidata, or further Open Data sources. In the course of the Project, you will deepen your knowledge about Linked Data and Knowledge Graphs.
1. http://crisp.ai.wu.ac.at/#about
2. A. Anjomshoaa, H. Schuster, J. Wachs, A. Polleres, Towards Crisis Response and Intervention Using Knowledge Graphs – CRISP Case Study, Lecture Notes in Business Information Processing (LNBIP) 482 (2023). https://csh.ac.at/publication/towards-crisis-response-and-intervention-using-knowledge-graphs-crisp-case-study/
3. Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, José Emilio Labra Gayo, Sabrina Kirrane, Sebastian Neumaier, Axel Polleres, Roberto Navigli, Axel-Cyrille Ngonga Ngomo, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. Knowledge graphs.ACM Computing Surveys (CSUR), 54(4):1--37, July 2021. Extended pre-print available athttps://arxiv.org/abs/2003.02320.
Write a Thesis