A Systematic Literature Review of Student Assessment Framework in Software Engineering Courses

Background: Software engineering are courses comprising various project types, including simple assignments completed in supervised settings and more complex tasks undertaken independently by students, without the oversight of a constant teacher or lab assistant. The imperative need arises for a comprehensive assessment framework to validate the fulfillment of learning objectives and facilitate the measurement of student outcomes, particularly in computer science and software engineering. This leads to the delineation of an appropriate assessment structure and pattern. Objective: This study aimed to acquire the expertise required for assessing student performance in computer science and software engineering courses. Methods: A comprehensive literature review spanning from 2012 to October 2021 was conducted, resulting in the identification of 20 papers addressing the assessment framework in software engineering and computer science courses. Specific inclusion and exclusion criteria were meticulously applied in two rounds of assessment to identify the most pertinent studies for this investigation. Results: The results showed multiple methods for assessing software engineering and computer science courses, including the Assessment Matrix, Automatic Assessment, CDIO, Cooperative Thinking, formative and summative assessment, Game, Generative Learning Robot, NIMSAD, SECAT, Self-assessment and Peer-assessment, SonarQube Tools, WRENCH, and SEP-CyLE. Conclusion: The evaluation framework for software engineering and computer science courses required further refinement, ultimately leading to the selection of the most suitable technique, known as learning framework.


I. INTRODUCTION
The primary goal of every institution is to ensure high-quality education, and to achieve this objective, each institution develops and offers excellent academic programs [1], [2].Quality standards were determined by program outcomes, assessments, metrics, and comparisons to meet target benchmarks and the evolving requirements of contemporary society [3].Moreover, the choice of a school or program significantly influences the lives of individuals, showcasing the importance of informed decisions for parents and students to make wise choices [4], [5].
Programming assignments serve as a common method for computer science instructors to evaluate course objectives [6]- [8].Computer science courses typically entail various assignments, including smaller tasks, which may be completed in supervised environments such as labs or classrooms, or more substantial assignments requiring students to develop solutions without constant instructor supervision.However, evaluating assignments alone cannot determine when students have successfully mastered the learning objectives [9]- [11].It is crucial for students to learn A. The Study Question The primary objective of this study was to comprehend the student assessment framework for software engineering courses, hence the following questions were formulated.
RQ1: What assessment methods were employed in software engineering courses?RQ2: What challenges did the assessment framework in software engineering courses encounter?
B. Search Process Based on the guidelines [34], [35], the search process was initiated by defining the criteria employed.Papers from journals, workshops, and proceedings were obtained from six digital libraries, namely ScienceDirect, Mendeley, IEEEXplore, ACM Digital Library, Springer Link, and Emerald, within the timeframe of January 2012 to October 2021.The search criteria were divided into two parts, C1 and C2, as outlined below: C1 is a string of keywords related to assessment frameworks, such as assessment and evaluation.C2 is a string of keywords related to software engineering, computer science, and programming courses.
The following is an example of a search performed in electronic databases: (Assessment Framework OR Evaluation Framework) AND (Software Engineering OR Computer Science OR Programming Course).Inclusion and exclusion criteria, based on Inayat et al. [36], were applied to refine and select previously searched papers.The inclusion criteria for the study consisted of (I1) Peer-reviewed publications, (I2) Publications written in English, (I3) Publications related to the search keywords, (I4) Publications falling into categories such as papers, experience reports, or workshop papers, and (I5) Publications published between January 2012 and October 2021.The exclusion criteria for this study comprised, (E1) reviews not specifically addressing assessment in the fields of software engineering or computer science, (E2) investigations lacking discussions on assessment frameworks in software engineering, and (E3) those failing to meet the predefined inclusion criteria.

C. Study Selection
Searches were conducted in selected electronic databases using the technique described in Section B to acquire pertinent studies.Table 1 showed the initial search results, yielding 4,962 studies but a considerable number of them failed to meet the inclusion criteria.Many results were excluded due to limitations in applying the search string across complete articles.In Round 1, study analysts thoroughly examined study titles and abstracts, adhering to the inclusion criteria, resulting in the selection of 78 samples.In Round 2, the exclusion criteria (E1, E2, and E3) were applied to reassess the preselected studies.Face-to-face discussions were held to address consensus issues raised during evaluations.When consensus could not be reached on an article, the entire paper was read and those based on the established exclusion criteria were rejected.After applying the inclusion criteria to the 78 preselected papers, 58 were discarded as they did not directly address the question.Detailed readings of the remaining 78 were conducted, ultimately excluding 58 that were irrelevant.

D. Quality Assessment
The quality criteria proposed by Guyatt et al. [38] were utilized in this systematic review to assess the methodological quality of the selected primary studies.These criteria comprised inquiries designed to gauge the completeness, trustworthiness, and significance of the studies, with the aim of evaluating the value of the synthesis findings and facilitating interpretation.An ordinal scale, based on the quality evaluation criteria rather than a binary measure, was employed to categorize and grade each study, following Guyatt et al. [38].The first criterion (C1) entailed establishing the goal, with an affirmative response rate of 87% in the studies.The second criterion (C2) evaluated the management and documentation of the study environment, also receiving an 87% affirmative response rate.For the third (C3) and final criterion (C4), an evaluation was made to determine whether each study provided a concise summary of its results, with 83% supporting this aspect.It should be noted that the quality ratings (C4) were derived based on these three criteria.

E. Data Extraction
In accordance with the guidelines of Kitchenham et al. [35], a data extraction procedure was devised to collect pertinent data from the primary studies included in the investigation.A template was crafted to document the theoretical foundations, methodological innovations, and empirical results, ensuring a higher-level interpretation.

IV. RESULTS
This systematic review aimed to examine frameworks or tools employed by educators for the identification of student competencies within software engineering courses.A total of twenty studies [39], [40], [49]- [58], [41]- [48], as previously mentioned, were identified.Among the studies reviewed, 55 percent (11 studies) were presented at conferences, while 45 percent (9) found their place in journals.The samples were uniformly distributed across various publication platforms, with each source featuring one or two papers, indicating an absence of preference among authors for any particular source.
It should be noted that no significant studies relating to this topic were published before 2012.The articles under scrutiny spanned diverse fields, covering the period between 2012 and 2021 as shown in Fig. 1.Meanwhile, 2019 witnessed a substantial discourse on assessment frameworks, and 2018 showed relatively less emphasis on this subject.The distribution of papers by country is visually shown in Fig. 2. Papers were predominantly distributed in the USA, with a presence in Europe, but no representation from South America or Australia.Unfortunately, geographical information pertaining to the authors of each paper remained undisclosed.
A selection of 20 papers was made, comprising three empirical reviews categorized as "empirically appraised," evaluating methods or instruments without experimental investigations and case studies, and 17 "empirically grounded" studies employing qualitative techniques, such as surveys and interviews.Empirical-based studies were further classified into case studies, surveys, and observations, with 12 papers adopting case studies, 4 utilizing surveys, and 1 employing observation as seen in Fig. 3.The subsequent section described 13 assessment frameworks identified and their utilization across the studies.

1) Assessment Matrix
In the study conducted by Zeid [39], the challenge of waning student interest in software engineering courses was addressed through a student competition.Evaluation of student performance in this competition hinged upon an assessment matrix.Similarly, Traverso-Ribon et al. [40] engaged in project-based learning assessment via a webbased technique, also incorporating the use of an assessment matrix.
The utility of an assessment matrix was instrumental in safeguarding the validity of evaluations [59], [60].Valid assessments accurately gauged the subject matter they were intended to assess, ensuring that they faithfully represented the desired learning outcomes for a course unit.However, it should be noted that reliance on an assessment matrix could introduce subjectivity into the assessment process, contingent upon the judgment of the assessor.

2) Automatic Assessment
WebWolf, a user-friendly framework designed for automated grading of assignments in introductory web programming courses, appeared as a viable solution.The WebWolf software demonstrated the ability to load web pages, identify and assess components, navigate hyperlinks, and establish expected outcomes.An evaluation of software performance included assignments from three distinct classes, incorporating a set of dispatch tasks deliberately embedded with errors.Impressively, WebWolf accurately identified errors within the submissions [41].This automated assessment approach extended its utility to the evaluation of UML class diagrams [42].Furthermore, it provided significant benefits in terms of assessment efficiency and objectivity for educators [43]- [46].Even though automatic assessment had evident advantages, particularly in the domain of multiple-choice and shortanswer questions, it also posed certain limitations.When faced with case study questions commonly encountered in software engineering courses, automated assessment presented challenges.Specific question types required specialized computations for precise scoring.Consequently, assessing diverse question types demanded distinct methodologies.For example, the evaluation of use case diagram questions [61] necessitated a different technique than the assessment of class diagram questions [62] due to their inherent distinctions.The efficacy of automatic assessment heavily relied on the accuracy of the answer key formulated by the instructor.
3) CDIO (Conceive, Design, Implement, and Operate) The imperative to elevate the quality of educational offerings within Higher Education Institutions (HEI) necessitated the development of enhanced assessment frameworks in line with contemporary standards.Existing evaluation frameworks were informal and unsuitable for continual quality enhancement.A potential solution arose through the proposal of a novel assessment model and measurement methodology rooted in the CDIO Standard [47].
CDIO had three overarching objectives [63], [64], firstly, it aspired to foster a comprehensive grasp of the fundamental technical aspects of software engineering.Secondly, it aimed to empower individuals to take the lead in conceiving and executing new software solutions.Thirdly, it underscored the importance of comprehending the broader significance and strategic implications of studies and technological advancements within society.CDIO assessments exhibited a high degree of complexity, spanning multiple levels including program philosophy and curriculum development.Implementing this assessment paradigm within a software engineering course necessitated harmonious integration across all faculty members.Additionally, reusing CDIO assessments presented a challenge, requiring continuous updates and incurring additional costs.

4) Cooperative Thinking (CooT)
The concepts of computational thinking (CT) and agile values (AV) focused on the ability of an individual for algorithmic thinking and the basics of collaborative software development.A model for addressing computational problems in teams was known as cooperative thinking (CooT).Cooperative thinking referred to a new skill aimed at promoting collaborative problem-solving of technical information relevant to alleviating challenging software engineering problems.CooT was recommended for educational purposes to teach students how to create software to enhance their performance both individually and collectively [48].
As opposed to the previous exam, CooT assessed the ability of students for cooperative thinking in carrying out tasks rather than their assignment performance.CooT in the software engineering course consisted of five evaluation concepts, namely complex negotiation [65], [66], continuous learning [67], group awareness [68]- [70], group organization [71], and social adaptation [72]- [74].

5) Formative and Summative Assessment
This framework was designed for long-term evaluation, hence assessors could evaluate from the beginning until the completion of the learning process.Summative assessments [75], [76] evaluated the current levels of knowledge and the skills of students.The instructional process included a phase called formative assessment [77], which was integrated into classroom teaching to provide the information necessary to adapt instruction and learning.In software engineering classes, traditional educators also used this assessment.Through a well-developed software, such as Instructional Module Development System (IMODS) [49], evaluations that were conducted could be reviewed.
An open-source, web-based course design tool called IMODS provided a framework for presenting curricula, specifically in STEM fields.It guided users through the development process, and the evaluation method selected for this course combined formative and summative assessments.The model developed examined the consistency of learning domains, performance, and criteria needed to match the assessments selected for the course with the learning objectives [49].

6) Game
Video game technology, in the context of education (also known as "serious games"), was one of the methods used for assessments.This was because games were often exciting and inspiring, thereby motivating students to study programming structures in a fun and familiar setting, and apply learning from those environments to understand the fundamentals of computer programming through coding.There was a significant need for explicit instruction and analysis of how games could be created specifically to enhance problem-solving skills [50], [51].
Games increased student engagement and provided a setting for accurate and relevant assessment [78].In addition to previous achievements and traditional assessments, games improved the success levels of students [79].It should be noted that games were designed to assess the quality of individual understanding.In the software engineering course, students generally enjoyed games because they frequently interacted with electronic devices.Therefore, assessment using this system was an effective tool for software engineering students.However, this framework had its weaknesses, as every game built had to be updated to be in line with the curriculum and the times.Games also required audio/visual technologies to support assessment [80].

7) Generative
Learning Robot E-learning was analyzed through a variety of viewpoints and methodologies.When the field was examined holistically and within the context of the Generative Learning Robot (GLO), a heterogeneous meta-program designed to teach computer science (CS) topics, namely programming, the process-based perspective appeared to be the most relevant.Drawing inspiration from the software engineering field, a feature-based technique was employed to model the inherent unpredictability of CS learning [52].

8) NIMSAD (Normative Information Model-based Systems Analysis and Design)
NIMSAD served as a framework for evaluating methods, industry/business practices, and problem-solving techniques.Meanwhile, methodologies could be oriented toward provisioning and enhancing operational efficiency, evaluating the suitability for addressing specific "problem situations" might be quite challenging.This was because the technique offered a general framework for reform [53].
NIMSAD found utility in problem-solving scenarios within software engineering courses [81]- [83], particularly in case studies focused on customer issues and solutions devised by students.NIMSAD evaluated three key factors, namely the problem solver, the contextual elements of the issue, and the procedure used to solve the problem.

9) SECAT (Software Engineering Competency Assessment Tool)
Evaluation often entailed two dimensions, firstly, there was an overarching view that considered the outcomes of various strategies, comprising an assessment of all course participants collectively.For example, this included comparing the competencies of students at the beginning of a course to those at the end.Secondly, there was an evaluation of competence concerning each student, which could be quantified on an absolute scale or in relation to their peers in the same course.Instructors evaluating performance gained valuable insights into the competencies of students by considering assessments from multiple assessors.Therefore, encouraging consistency among raters when using SECAT conventions enhanced its validity and reliability [54].

10) Self-Assessment and Peer Assessment
Proficiency in basic programming skills served as a prerequisite for Information Technology (IT) students.Mere memorization of programming concepts was considered insufficient, but, to gain a thorough understanding, students had to engage in practical tasks personally.Assessment and evaluation were conducted within the Problem-Based Learning (PBL) framework, assessing both the process and product.From a process perspective, it was crucial to consider and assess the soft skills of students, including self-learning and teamwork abilities, developed through study and project work.This perspective comprised two sub-aspects, namely self-evaluation and peer review [55].Criteria for evaluation entailed learning motivation, self-directed learning capabilities, teamwork skills, and communication abilities.It was worth noting that the subjective element remained a limitation in this assessment.

11) Tools SonarQube
Programming referred to a hands-on discipline, and assessing the practical computer skills of students, specifically in large cohorts, was challenging.It should be noted that issues consisting of potential plagiarism and variations in the entire task difficulty often arose.To mitigate this problem, a multidimensional evaluation model for computer science courses was developed, drawing from extensive practice data within an e-learning system.This model comprised three dimensions, namely accuracy, originality, and quality detection, allowing for an in-depth examination of learning processes.Correctness was fundamental, while originality assessment entailed categorizing student behavior after detecting potential plagiarism in assignments.Additionally, teachers had the flexibility to participate in evaluations according to their needs.SonarQube was employed to detect code quality and suggest higher code standards [56].
12) WRENCH Casanova et al. [57] devised a collection of pedagogical modules that could be periodically integrated into university courses to address these challenges.These courses incorporated simulation-driven exercises providing students with hands-on exposure to critical application and platform scenarios.Simulators, built using the WRENCH [84], [85] simulation framework, supported these endeavors.Following the introduction and outline of their approach, the study Fauzan, Siahaan, Solekhah, Saputra, Bagaskara, & Karimi Journal of Information Systems Engineering and Business Intelligence, 2023, 9 (2), 264-275 271 described and evaluated the outcomes based on assessments conducted during two consecutive iterations of an undergraduate university course.13) SEP-CyLE (Software Engineering and Programming Cyber learning environment) SEP-CyLE employed a cognitive walkthrough coupled with a think-aloud methodology and a heuristic evaluation technique.The aim of the UI/UX evaluation was to enhance cyberlearning and deliver a cyberlearning environment tailored to specific users [58].Subsequently, network-based analysis was employed to identify statistically significant correlations within the heuristic assessment survey, particularly regarding the perceptions of students in using SEP-CyLE.The application of this framework had some similarities with MOODLE [86], [87].

B. RQ2: What are the challenges of the assessment framework in a software engineering course?
In this section, the challenges associated with assessment frameworks in software engineering courses were explored.

1) Students have difficulty understanding the material, making it tedious to carry out an assessment
An existing assessment framework was identified as inappropriate for the task at hand, necessitating adjustments.Previous studies [50], [51] recognized a common issue, such as students struggling to grasp computer programming concepts.This challenge hindered the evaluation of students, as they often possessed a superficial understanding of the subject and lacked effective problem-solving skills.Zeid [39] addressed declining student interest in software engineering courses by introducing a competition.However, assessing performance in these competitions was quite challenging, and this was the reason the assessment matrix was also employed for evaluating students.Competency assessment [54] became a complex task due to vague definitions.Various solutions, including the use of game-based intermediaries [88], [89], were proposed to make the material more engaging.
2) Differences in educational practice make it difficult to find the proper assessment framework For an extended period, educational practices centered around instructors had been considered the cornerstone of effective teaching.This created the perception that students primarily memorized material to pass examinations, as their focus remained solely on classroom lessons.Furthermore, due to the limited exposure to real-world scenarios, individuals often encountered difficulties when presented with practical challenges.In the context of Information Technology (IT), establishing a robust foundation in programming had been deemed essential.Simply memorizing programming concepts had proved inadequate, but to attain a comprehensive understanding, students needed to actively engage in practical exercises.Although methodologies had often prioritized provisioning and operational efficiency, evaluating the applicability to specific real-world situations remained a complex task.This was the reason a framework facilitating the analysis of methodologies, industry practices, and problem-solving techniques became necessary.Assessment was conducted using the Problem-Based Learning (PBL) framework, evaluating both the process and the product [55].

3) Assessors must read and analyze student source code, which takes a long time
The assessment of student work in web page assignments presented challenges and consumed a significant amount of time.This process entailed the inspection and analysis of source code to ensure it adhered to the requirements.Consequently, there was likely a reduction in the number of assignments given, and educators dedicated substantial time to grading, which limited their interactions with students and hindered other essential tasks.The page should also be examined in the browser for necessary features, such as operational links [41]- [46].

4) Plagiarism on student assignments
In the field of computer science, assessing the computer skills of a large student population using computer-assisted tools proved to be intricate.Challenges included allegations of plagiarism and variations in the entire difficulty level.

C. Implication and limitation
The geographical distribution of studies was emphasized, and the results showed that the majority of investigations originated in the USA.This was not surprising, considering the esteemed reputation of the country in the aspect of education, particularly within the fields of computer science and software engineering.On the other hand, studies addressing assessment frameworks in software engineering courses from Asian countries represented only a small fraction.
This study identified 13 assessment frameworks, including the Assessment Matrix, Automatic Assessment, CDIO, Cooperative Thinking, formative and summative assessment, Game, Generative Learning Robot, NIMSAD, SECAT, Self-assessment and Peer Assessment, SonarQube Tools, WRENCH, and SEP-Cycle.As a result, alterations to the framework were considered necessary due to the recurring issues that were recorded.
Adjustments were implemented in the assessment framework to facilitate effective student evaluation.Some of the adjustments assisted in simplifying the process of evaluating student work.
This review had significant implications as technology advanced and the field of computer education evolved.This evolution presented challenges in assessing computer courses, necessitating extensive study to create appropriate assessment frameworks that would be in line with the learning system.
It should be noted that this systematic review encountered two primary constraints, namely potential bias in study selection and data extraction from source variables.To mitigate these biases, a multistage technique was employed in which two investigators independently assessed the relevance of each study based on predetermined inclusion and exclusion criteria.Initially, searches were conducted using similar keywords in electronic databases, then papers were selected based on predefined rules, and each study result went through meticulous evaluation.

VI. CONCLUSIONS
In conclusion, this paper provided an overview of the literature concerning assessment frameworks in software engineering and computer engineering courses.An initial pool of 4,962 papers was obtained from reputable electronic databases, and 20 relevant papers were selected through a meticulous, step-by-step stratified transfer process that went through independent validation at each stage.The examination of these papers revealed a recurring theme, such as the necessity for adjustments within the assessment framework.This need came from an inconsistency between the framework and learning activities in progress.Consequently, the task of accurately assessing student work presented a formidable challenge.Across the papers dedicated to assessment frameworks in software engineering and computer engineering courses, a total of 13 distinct frameworks were selected, each crafted to address these issues.This exploration shed light on several challenges inherent in the assessment framework of software engineering courses.The challenges comprised struggles faced by students in comprehending course materials, disparities in educational practices, the intricacies of detailed assessments, and concerns related to plagiarism in student assignments.
This study possessed substantial value in its endeavor to map the encountered problems and the corresponding solutions in the assessment framework, particularly within the domains of software engineering and computer science education.These insights could serve as foundational references for educational institutions aiming to establish assessment frameworks that seamlessly fitted with the selected pedagogical techniques.

Fig. 1 A
Fig. 1 A breakdown by year of selected studies

Fig. 3
Fig. 3 Variations in study distributions based on the technique employed

TABLE 1 A
NUMBER OF STUDIES WERE DISCERNED IN THE VARIOUS PHASES OF OUR SYSTEMATIC SEARCH