IntroductionProject-based learning (PBL) often culminates in student-created artifacts, such as videos, that demonstrate learning in an engaging way. With the advent of generative AI, students can now feed text (scripts or prompts) into AI tools to produce multimedia videos, introducing new opportunities and challenges in education. On one hand, AI-generated videos can be personalized and visually rich, potentially enhancing student engagement and creativity in PBL tasks.On the other hand, assessing the quality of these AI-assisted creations is not straightforward. Educators need reliable methods to evaluate qualitative aspects of student videos—such as their emotional tone, clarity, and accuracy—to ensure learning objectives are met. Traditionally, teacher or expert reviewers would perform this evaluation, but AI-powered analyzer tools promise a faster, consistent alternative. This research proposal outlines a study to compare an AI analyzer tool versus human evaluators in the qualitative analysis of student-created, text-to-video AI-generated projects. The goal is to determine how closely an AI’s assessment of video content aligns with human judgment, and what this means for future classroom assessment practices.Background and Literature ReviewAI-Generated Video in EducationGenerative AI technologies are increasingly influencing educational content creation. Recent studies show that AI-generated instructional videos can achieve learning outcomes comparable to traditional teacher-made videos.For example, Netland et al. (2024) found that learners acquired similar knowledge from AI-created teaching videos as from human-crafted ones, even if they preferred the human videos for the learning experience.These AI videos, produced by large language models and related tools, have been successfully used to convey concepts in various subjects. In a science teacher education context, Pellas et al. (2024) demonstrated that AI-generated videos boosted learners’ self-efficacy and knowledge retention, positioning them as promising assets for instruction.AI systems can rapidly turn textual input into explainers or story-based videos – for instance, the StudyFetch Imagine Explainers tool can generate a 10-60 minute educational video from course materials or a given topic prompt.This ease of content creation suggests AI videos will proliferate in classrooms, making it important to study how to evaluate their quality.Despite these advantages, researchers note some limitations of AI-driven videos. One concern is reduced social presence or human touch – students often report AI-generated videos feel less engaging or relatable than instructor-led ones.Shahzad et al. (2024) observed that learners missed the connectedness of a human presenter, which could lead to distractions or lower emotional engagement.However, if the content itself is highly relevant and well-crafted, students may overlook the lack of a human presenter.As generative video technology improves (e.g. more lifelike AI avatars), these engagement gaps may narrow.These findings highlight the need to assess not just factual content of AI videos, but also qualitative factors like tone and engagement that influence learner reception.Project-Based Learning and Student-Created AI VideosPBL is a student-centered pedagogy where learners gain knowledge by working for an extended period to investigate and respond to complex questions or challenges. A hallmark of PBL is the creation of tangible artifacts (presentations, portfolios, videos, etc.) that showcase student learning in authentic contexts. Integrating AI into PBL can significantly enrich this. Frydenberg (2025) argues that AI tools can be “key players” in open-ended projects by assisting with brainstorming, research, and content creation.Instead of replacing the learning process, AI can serve as a tool that students use to develop and refine their learning.For example, students might use an AI text-to-video generator (such as Canva’s or VEED’s AI video tools) to transform their written ideas into a dynamic video. This allows students to focus on crafting the narrative and verifying information, while the AI handles the technical production.Early evidence suggests that allowing students to leverage AI in projects can maintain or even boost engagement and motivation. AI-assisted creation frees students to concentrate on higher-order thinking—like analyzing and organizing content—since the AI automates lower-level tasks.The resulting AI-generated videos in PBL are often personalized and context-rich, making scientific or historical scenarios come alive.They can provide “dynamic scaffolding” for problem-solving, meaning the videos adapt or present information in a way that aids understanding of the problem at hand.For instance, an AI video might visualize a scientific experiment or animate a historical scene described in a student’s script, helping the entire class engage with the material.However, the use of AI also raises questions about how educators can assess these AI-crafted works. In PBL, assessment is typically authentic and multidimensional, focusing on both content mastery and skills like creativity or collaboration. When a video is partially AI-generated, teachers must discern the student’s contribution (e.g. the quality of the script, the ideas and narrative) and the overall effectiveness of the video as a learning artifact. This is where an AI analyzer could assist by providing an initial analysis of the video’s qualities, but its effectiveness relative to human judgment remains uncertain. Our study is situated at the intersection of these trends: it examines student-created AI videos in PBL and how they are evaluated, aiming to inform best practices as AI becomes a normal part of student work.AI Analyzer Tools for Qualitative EvaluationAI analyzer tools refer to software systems that use artificial intelligence (such as machine learning and natural language processing) to automatically evaluate content along various qualitative dimensions. Unlike purely quantitative analytics (e.g. viewing time or quiz scores), a qualitative AI analyzer attempts to judge aspects like the tone of speech, coherence of a narrative, or accuracy of information in a piece of content. Modern AI analyzers combine techniques: for video content, they often include speech-to-text transcription, NLP analysis of the transcript, computer vision for image/frame analysis, and sometimes audio analysis for voice tone. These tools have seen rapid development and are increasingly used in media and education. For instance, an AI video analysis platform can detect key scenes, track audience engagement patterns, and even generate content summaries.More advanced systems go further by assessing qualitative aspects – e.g. identifying if a speaker’s tone is positive or negative, or if the presentation follows a logical structure.In the context of student-created videos, an AI analyzer could perform several relevant checks: Emotional Tone Detection: Using sentiment analysis and facial expression recognition, AI can evaluate the emotional tone conveyed in the video.This might involve analyzing the narration or on-screen avatar’s voice for positivity/enthusiasm versus monotony, and detecting any facial cues if human or avatar presenters are visible. For example, sentiment analysis algorithms can classify speech transcripts as having a positive, negative, or neutral tone. Likewise, computer vision can recognize facial expressions corresponding to emotions like happiness, surprise, or confusion. Together, these AI features estimate the affective quality of the video – an important engagement factor. An AI analyzer might report that a video’s overall tone is engaging and upbeat, or conversely that it comes across as flat or emotionally neutral. Content Accuracy Checking: AI tools can cross-reference the statements in a video against reliable knowledge bases to flag inaccuracies. For instance, if a student’s AI-generated video claims a historical event date or a scientific fact, a fact-checking algorithm can compare that claim with databases or use a large language model with access to the web to verify it. Current AI fact-checkers use NLP to extract factual claims and search for corroborating or conflicting information. While not perfect, they can catch blatant errors. In a recent analysis, an AI system achieved around 72% accuracy in identifying correct vs incorrect factual statements, approaching the performance of GPT-4 (around 65% on the same task). This indicates AI can assist in evaluating content accuracy, although human oversight is still often needed for subtle or context-dependent facts. In our study, the AI analyzer is expected to flag any factual inconsistencies or probable errors in the students’ video content, contributing to an “accuracy” score for each video. Clarity and Coherence Analysis: Natural language processing models, especially large language models (LLMs), have shown the ability to assess the clarity and coherence of text. When provided with the transcript of a video, an AI analyzer can evaluate whether the ideas are presented in a logical order and whether the narrative is easy to follow. Researchers have developed AI metrics and LLM-based evaluators that align closely with human judgments of discourse coherence. Notably, a 2023 study trained GPT-4 to rate the coherence of written text and found its ratings were highly comparable to expert human raters. Such AI can highlight if a video’s script jumps disjointedly between topics or if it maintains a clear focus throughout. Additionally, clarity (how clearly concepts are explained) can be gauged by analyzing sentence structure and vocabulary level. If the script is overly jargonistic or ambiguous, AI language models can detect that by comparing against expected explanations. In essence, the AI analyzer will use NLP techniques to judge if the video’s message flow is well-structured and clear for the intended audience. Engagement and Presentation Quality: Some AI video analysis tools track indicators of viewer engagement or presentation quality. For example, machine learning can assess speaking clarity and pacing (was the narration too fast or mumbled?), detect visual variety (presence of images, scene changes), and even predict audience interest by analyzing content features. Prior work in automated grading of presentations shows that algorithms can assess criteria like clarity of speech and level of engagement based on vocal dynamics and visual cues. Our AI analyzer will examine the student videos for elements such as monotony vs. dynamism in presentation, usage of multimedia elements, and overall production values (stability of visuals, sound quality, etc.). While some of these factors are technical, they contribute to how engaging a video is. The AI might, for instance, measure engagement by the variety of visuals and the emotional tone (under the assumption that an enthusiastic tone is more engaging). Creativity Indicators: Evaluating creativity is inherently challenging for both humans and AI. However, AI analyzers might look for proxy indicators: unusual or original combinations of media, novel storytelling approaches, or unique perspectives in the content. A possible approach is using an LLM to judge the novelty of the video’s content compared to typical answers on the topic. Another approach is image analysis to see if the visuals used are generic stock images or something more inventive. Although there is no straightforward metric for creativity, the AI’s qualitative report might include observations on elements that seem particularly original or, conversely, very formulaic. (For example, an AI might note if the video’s script appears to closely mimic a Wikipedia article versus telling a story in the student’s own voice.) By combining these analyses, an AI analyzer tool produces a profile of each video across multiple qualitative dimensions. For this study, we will either use an existing integrated AI platform or develop a custom pipeline that includes: automated transcription, sentiment analysis for tone, an LLM-based coherence and clarity rater, a fact-check API for accuracy, and detection of engagement-related features. The output will be a set of qualitative scores or descriptors for each video (e.g. tone: positive; coherence: high; accuracy: 8/10; engagement: moderate; creativity: noted in use of examples). It is important to note that AI analysis, while fast and consistent, has known limitations. AI might struggle with context or nuance – for example, understanding sarcasm or the educational appropriateness of content.Additionally, AI evaluations lack the lived experience and pedagogical insight a human teacher brings. These caveats will be kept in mind when comparing AI and human evaluations.Human-Based Evaluation of Student VideosHuman evaluation in this context refers to expert qualitative assessment by educators or trained raters who manually watch each student-created video and judge its quality. This is the de facto standard in education: teachers often use detailed rubrics to assess student projects, ensuring they consider multiple criteria and maintain consistency.A typical video project rubric, for instance, will include criteria such as content accuracy and depth, clarity of communication, creativity/originality, and technical execution (sound, visuals).Each criterion might be rated on a scale (e.g. 1 to 4 or 1 to 5) with descriptors for what constitutes “Excellent” vs “Needs Improvement” performance. For example, a rubric may define Content as excellent if it covers the topic in depth with no factual errors, and Originality as high if the video shows creative, witty ideas not just copied from sourcesEducators use such rubrics to guide their evaluations, providing both numeric scores and written feedback.In our study, the human-based evaluation will be conducted by educator reviewers (such as teachers or experienced instructional designers) who are knowledgeable about the project’s learning goals. We will recruit multiple evaluators to allow measurement of inter-rater reliability. Each evaluator will review the student videos independently and rate them on the predefined qualitative aspects (detailed in the next section). They will also provide brief qualitative comments to justify their ratings or highlight strengths and weaknesses of each video. This manual evaluation process is time-consuming, but it benefits from human abilities to understand context, nuance, and student voice. Humans can pick up on subtle cues – for example, recognizing a student’s humor, or appreciating a particularly innovative approach – which an AI might miss or misidentify. Moreover, teachers can judge appropriateness (whether the video’s content is suitable for the assignment and audience) and cross-reference it with what was taught, something that would be hard for a generic AI tool without specific training.Human evaluation is not without its own challenges. There can be variability in judgments: what one teacher finds engaging, another might not. Personal biases and expectations can influence scores (one reviewer might be stricter about factual accuracy, another might place more weight on creativity). However, using a common rubric and training the raters on it can mitigate some inconsistencies. Compared to AI, humans are better at explaining their reasoning in natural language and adapting their criteria when unusual cases arise. For instance, if a video intentionally breaks a conventional rule to achieve a creative effect, a human can recognize and credit that, whereas an AI might just mark it as a deviation. Human evaluation is considered the benchmark in our study – we treat the aggregate human judgment as the standard to which the AI analyzer’s results will be compared. Ultimately, the goal is not to prove one is “better” outright, but to see in what areas they converge or diverge. This can illuminate whether AI tools can augment the teacher’s role in assessment or even handle some aspects autonomously under supervision.Research Objectives and QuestionsThe primary objective of this research is to compare the qualitative evaluations produced by an AI analyzer tool with those by human evaluators for student-created AI-generated videos. We seek to determine the extent to which the AI tool can replicate or approximate human judgment in complex qualitative criteria, and to identify areas of alignment and discrepancy. The findings will inform how AI might be integrated into educational assessment processes for project-based work.Key research questions include: How do AI-generated evaluations correlate with human evaluations for each qualitative aspect of student videos? – We will examine metrics like emotional tone, content accuracy, clarity, coherence, engagement, and creativity, comparing the AI’s rating/description to the human raters’ assessments. In what areas do the AI analyzer and human evaluators most strongly agree or disagree? – For example, does the AI tool closely match human judgments on factual accuracy but differ on gauging creativity or emotional impact? Identifying these patterns will highlight the strengths and limitations of the AI approach. What is the inter-rater reliability between human evaluators, and how does the AI’s “judgment” align within that variability? – This asks whether the AI’s ratings fall within the range of what human experts consider acceptable (even if not identical to the average) or if they consistently deviate. What are the practical implications of using an AI analyzer for assessment in terms of efficiency, feedback quality, and feasibility? – While not a purely qualitative comparison question, the study will document observations on time taken for AI vs human evaluation, and the nature of feedback produced, to discuss how the findings could influence classroom practice. By addressing these questions, the research will provide a nuanced understanding of whether AI tools can support or complement human evaluators in assessing student-created AI videos. The comparison will not only consider scores but also qualitative insights from each source, thereby giving a comprehensive picture of evaluative consistency and value.MethodologyResearch Design OverviewWe will use a comparative mixed-methods design involving parallel evaluations of the same set of student-created videos by an AI tool and by human raters. The study is primarily evaluative and correlational: we are comparing two modes of qualitative analysis rather than manipulating a treatment (all student participants experience the same PBL activity without intervention). The independent variable is the type of evaluator (AI analyzer vs human), and the dependent variables are the evaluative outcomes on various qualitative criteria (scores or ratings on tone, accuracy, etc., as well as narrative feedback). The design includes both quantitative components (e.g. scoring and statistical comparison of AI vs human ratings) and qualitative components (analysis of comments/feedback, case studies of disagreements).We will ensure that the AI evaluation and human evaluations are conducted separately and blind to each other: human raters will not see the AI’s assessment of a video, and the AI will of course not have access to human scores. This avoids any cross-influence. After both sets of evaluations are complete, we will analyze the results for each video and each criterion.Because multiple human evaluators will be used, we will also measure their inter-rater reliability. This provides a baseline of how much agreement is expected even among humans. If the AI’s ratings fall within the consensus range of the human raters, that will be considered a form of agreement. We will treat the combined human evaluation (e.g. the average of human ratings or a consensual qualitative summary) as the “ground truth” reference, while acknowledging it has its own margin of error.The study will be conducted in an authentic educational setting (e.g. within a course or a set of PBL workshops) to ensure ecological validity of the student-created videos and the evaluation context.Participants and ContextStudent Participants: The subjects of this study will be students engaged in a project-based learning unit that involves creating an AI-generated video. We plan to recruit approximately 30-50 students from a high school or undergraduate program, depending on availability and class sizes. These students will work in small teams (2–4 per team) to produce their videos, resulting in an estimated 15–20 videos for analysis. All students will receive the same project brief (adapted to their curriculum). For instance, a possible project prompt is: “Create a 3-5 minute educational video explaining a scientific concept or historical event, using an AI tool to generate the video from your written script.” This ensures some consistency in video length and purpose, while allowing creative freedom.The participants will be drawn from classes where multimedia projects are appropriate (e.g. a science class explaining scientific phenomena, or a social studies class presenting historical narratives). We will ensure diversity in the sample in terms of student background and topic choices, to see how the evaluations perform across varied content. All participation will be voluntary and with appropriate consent/assent; if part of class work, alternative options will be given for students who do not wish to be in the study (so that grading is not tied to research participation).Human Evaluator Participants: We will involve 3-5 human evaluators with expertise in education and multimedia projects. Ideally, this panel will include the course instructor(s) and additional independent educators (for example, a teacher from another school, or a graduate student in education research) to reduce potential bias. These individuals will be trained on the evaluation rubric (discussed below) prior to reviewing the videos. They will not be the students’ regular graders for the course or will perform the evaluation after course grades are assigned, to avoid conflict of interest. Their task is purely to assess the videos for research purposes. Having multiple evaluators allows us to measure how consistently humans apply the rubric and to use an aggregated human judgment as a robust comparator for the AI tool’s output.Context: The study will be carried out in the context of a project-based learning assignment over the course of a school term or semester. Students will have several weeks to research their topic, write a script or outline, and then use an AI video generation tool to produce the final video. We expect them to use text-to-video generators such as Canva’s AI Video, Synthesia, Powtoon with AI, or similar platforms that can turn a written narrative into an animated or recorded video. The choice of tool may be left to students (with recommendations provided) to mimic real-world usage; however, all teams will produce a digital video file that can be analyzed. We will document which tool each team uses and any notable differences (e.g. some might use an AI avatar presenter, others an animated storyboard style) as that could slightly affect how tone is conveyed (avatar voice vs. purely text animations).Class time (or project time) will be given for students to learn the AI tool and create their videos. We will also collect students’ written inputs (scripts/storyboards) as ancillary data. After submission, the videos will be compiled for evaluation.Materials and InstrumentsAI Video Generation Tools: Students will be provided access to at least one text-fed AI video generator. One likely choice is Synthesia, which allows users to input a text script and select an AI avatar to narrate, producing a video with the avatar speaking the script. For consistency, we may choose one platform for all students or allow a small selection. The decision will be documented in the methodology. The key is that all videos are student-created via AI: students supply the content (text and design choices) and the AI renders the final video.AI Analyzer Tool: The core instrument on the AI side is the AI analyzer system that will evaluate each video. We have two approaches: (a) use a commercial or open-source AI video analysis platform, or (b) implement a custom analysis pipeline using existing AI models. After surveying available tools, we have opted for a custom pipeline approach to fully target the specific qualitative criteria. The pipeline will operate as follows: Use an automated speech-to-text service (such as Google Speech-to-Text or Whisper) to get an accurate transcript of each video’s audio narration. (If the generation platform already provides the transcript, we will use that after verification.) Feed the transcript into a large language model (LLM) like GPT-4 (through its API) with a carefully crafted prompt to evaluate the content. The prompt will ask the LLM to analyze the text on dimensions of clarity, coherence, factual accuracy, and creativity. For example, “Analyze the following video transcript. Provide an assessment of: (1) Clarity of explanation, (2) Logical coherence of the content flow, (3) Any factual errors or inaccuracies, (4) The creativity or originality of the presentation. Use a scale from 1 (poor) to 5 (excellent) for each, with a brief justification.” Recent research indicates GPT-4 can produce coherence evaluations consistent with experts, so we expect it to handle such prompts well. The LLM’s response will be parsed to extract scores and comments for Clarity, Coherence, Accuracy, Creativity. Apply a tone and emotion analysis on both the transcript and the audio. For the transcript, we will use sentiment analysis (e.g., VADER or a transformer-based sentiment classifier) to gauge the overall emotional tone of the text (looking for positivity, negativity, enthusiasm, etc.). For the audio (or the properties of the voice in the video), if available, we will analyze speech prosody. Since AI voices are used, the variation in tone might not be large, but we can still measure speaking rate, volume modulation, and any emotional indicators if the AI voice has them. Additionally, if the video includes an AI avatar with facial expressions, we could use a vision model to detect facial emotion, though in many AI videos the avatar’s expression may be relatively neutral. A tool like the ScreenApp’s emotion detector illustrates how facial, textual, and vocal cues can be combined to identify emotions. Our AI analyzer will produce a descriptor for Emotional Tone (e.g., “Informative and neutral tone” or “Enthusiastic and friendly tone”) and possibly an engagement proxy score from this (as tone contributes to engagement). Gather metadata on engagement features: While we cannot measure actual viewer engagement without an audience, we will have the AI evaluate the video’s engagement potential. This includes checking: the length of the video (is it likely too long for attention span?), the presence of interactive or visually interesting elements (which we might identify by key frame analysis or the variety of imagery), and the pacing. Some of these can be inferred from the transcript (if the content is very dense or monotonous) and some from the video file (frequency of scene cuts or slide changes). If using computer vision, we could detect scene transitions or on-screen text usage. For simplicity, the LLM’s prompt might also ask “Does the video maintain viewer interest? How engaging is the tone and content (low, medium, high)?” to get an engagement rating. Engagement will thus be a somewhat holistic judgment by the AI, combining the tone, clarity, and media variety it can observe. The AI analyzer tool will output a structured report for each video, containing scores or ratings for each of the six qualitative criteria (Tone, Accuracy, Clarity, Coherence, Engagement, Creativity) along with explanatory notes. These results will be saved for analysis.Human Evaluation Rubric: We will develop a rubric for the human evaluators that mirrors the criteria the AI is judging, to enable direct comparison. The rubric will have the following six criteria (each defined clearly for the raters): Emotional Tone and Delivery: Is the video’s tone appropriate and engaging? Raters will consider the narrator’s enthusiasm or expressiveness, and whether the tone suits the content (e.g. upbeat for a friendly explainer, serious for a serious topic). They will mark this on a scale (e.g. 1 = very flat/inappropriate tone, 5 = extremely engaging and fitting tone) and can note examples of effective or ineffective tone. This includes the delivery style of the content. Content Accuracy and Depth: Are the facts and information in the video accurate and sufficiently detailed? Raters will verify key information with their own knowledge or common references. They will also judge if the content demonstrates understanding (depth) or if it’s superficial. Scoring: 1 = many inaccuracies or misconceptions, 5 = completely accurate with excellent detail. We will instruct raters to note any specific error they find (to compare with AI’s findings). Clarity of Explanation: How clear and understandable is the video’s message? This covers the clarity of speech (or text on screen) and the clarity of the ideas presented. A high score means the video is easy to follow and points are explained well for the intended audience. A low score might indicate confusing wording or assumptions that the viewer knows more than is explained. Coherence and Organization: Does the video have a logical structure and flow? Raters will assess whether the introduction, body, and conclusion (if applicable) of the video connect logically. Does each part follow from the previous? Are there clear transitions? Score: 1 = very disorganized or jumps around, 5 = excellent logical flow with well-structured narrative. Engagement and Visual/Audio Design: How engaging is the video overall, considering its visuals and pacing? Here the human considers if the video kept their interest. They will note use of visuals, graphics, or examples that made it interesting, and whether the pacing was appropriate (not too slow to bore or too fast to overwhelm). Essentially, this is the viewer engagement criterion. Score: 1 = not engaging (hard to stay focused), 5 = very engaging and enjoyable to watch. Creativity and Originality: To what extent does the video demonstrate creativity in approach? Raters look for originality in the content or format. Did the student team take an innovative angle, use a clever analogy, or present the information in a unique way? Even within the constraints of using an AI generator, students can show creativity in their script or choice of visuals. Score: 1 = very conventional or copied feeling, 5 = highly original and creative. Each criterion will have a brief descriptor on the rubric, and space for the rater’s short comment. We will pilot this rubric with a couple of sample videos (possibly not from the main study, or from previous class projects) to ensure the criteria are clear and the scoring is consistent.Data Recording Instruments: For human ratings, we will use a Google Form or similar online form where each evaluator can submit their ratings and comments for each video. This will populate a spreadsheet. For AI results, the outputs will be collected in a structured format (JSON or CSV) with the scores and text. We will maintain a master dataset that links the video ID with the human scores (from each rater) and the AI scores.Procedure1. Project Execution (Student Phase): During the first phase of the study, students will execute their PBL assignment. They will research their chosen topic, compose their video script, and use the designated AI tool to generate the video. We (the researchers) will observe this process to note how the AI tool is used (without intervening in content creation). This observation is mainly to contextualize the outcomes: for example, if a team struggles with the AI voice settings, that might affect the tone. Once the videos are completed (by the project deadline), students will submit the video files (and optionally the script if separate). We will verify that videos meet basic requirements (length, audibility) and then label each video with an anonymized ID. At this point, any direct student involvement is done; the rest is evaluation.2. AI Analyzer Evaluation (AI Phase): We will run each submitted video through the AI analyzer pipeline. This will be automated as much as possible to ensure consistency. For each video: Extract or receive the transcript (ensuring accuracy by comparing a snippet of transcript to audio). Feed transcript and other data into the analysis algorithms/LLM as described. (If using GPT-4, this will be done programmatically via its API for each video in a loop.) Save the AI’s resulting evaluation. If the AI returns narrative text, researchers will parse it for the key points. For reliability, we might run the analysis prompt twice at different times and see if it’s stable; if not, we may average the results or refine the prompt. This yields an AI-generated evaluation for every video on all criteria. This step should take relatively little time (a few minutes per video at most, so for 20 videos perhaps a couple of hours in total including overhead).3. Human Evaluation (Rater Phase): Separately, the human evaluators will be given access to the videos (for example, via a private YouTube playlist or shared drive). Each evaluator will be blind to which team or class the video came from, and they will work independently. We will give them the rubric and perhaps a calibration session: for instance, all evaluators might first watch one example video (outside the main dataset) together and discuss scores to align their understanding. Then they proceed to rate the actual student videos on their own. They will fill out the evaluation form for each video, providing scores 1–5 and comments for each qualitative aspect. We will encourage them to be thorough in comments especially if a score is very high or low. The evaluators will not know anything about any AI analysis that was done; they are assessing purely as they normally would a student project. This phase may take several days to a week, as each video (3-5 minutes long) plus writing feedback could take ~10 minutes, so 20 videos would be 200 minutes (~3.3 hours) of viewing plus time to record judgments. With breaks, etc., each rater might spend around 5–6 hours total, spread over multiple sessions.We will collect all human rating forms. Afterward, we (researchers) will compile the human data: we may calculate an average score for each criterion per video among the raters, and note the range or any disagreements. If two raters differ greatly on a video, we might have a follow-up discussion with them (if feasible) or note it as variability.4. Data Analysis (Analysis Phase): Once both AI and human evaluations are in, we will conduct the analysis as outlined in the next section (statistical comparison, etc.). This effectively closes the data collection procedure.Throughout the procedure, ethical considerations such as privacy and fairness will be maintained. Videos will be stored securely and only accessed by the research team and evaluators. Any potentially identifying student information in videos (faces, names mentioned) will be noted: since videos are AI-generated from text, likely students will not be on camera themselves (e.g. an AI avatar might be), but if they do appear, we will ensure evaluators treat it confidentially and it is not shared. Upon completion, we will debrief participants (students and teachers) on the findings in summary, so they can also learn from the process.Data Analysis PlanAfter gathering the evaluation data from both sources, we will perform several analyses to answer our research questions:1. Descriptive Statistics: We will start by computing descriptive statistics for each qualitative criterion under each evaluation method. For each aspect (tone, accuracy, clarity, coherence, engagement, creativity): Calculate the mean and standard deviation of the human scores (averaging across raters per video, then across videos for an overall mean). Calculate the mean and standard deviation of the AI analyzer scores for the set of videos. This will give an initial sense if the AI tends to score higher, lower, or similarly to humans on average. For example, we might find the AI often gives higher clarity scores than humans do, indicating possible leniency or insensitivity to certain issues. 2. Correlation Analysis: To see if the AI and human evaluations align, we will compute the correlation coefficient (e.g. Pearson r or Spearman rho) between the AI’s score and the human average score for each criterion across the videos. A high correlation (close to 1.0) would mean the AI and humans rank the videos similarly on that aspect. A low or zero correlation means the AI’s judgments are essentially unrelated to human judgments. For example, we will get a correlation for Tone_AI vs Tone_human, Accuracy_AI vs Accuracy_human, etc. We will also look at an overall correlation if we combine all criteria or an overall quality score (though combining may not be straightforward since they measure different things).3. Agreement and Reliability: Beyond correlation, we will assess agreement more directly. For each video and criterion, we’ll see if the AI’s rating falls within the range of human ratings. We can use measures like Cohen’s kappa or Intraclass Correlation Coefficient (ICC) if we treat AI as another rater. Specifically, treating each video’s evaluation as an item, we have multiple human raters and one AI “rater” – we can compute an ICC for each criterion including all raters (human and AI) and compare it to the ICC among just humans. Another simpler approach is to calculate how often the AI’s score is within ±1 of the human average on the 5-point scale, as a percentage agreement. We will also examine any systematic bias: do AI scores consistently skew higher or lower? A paired t-test can be conducted for each criterion comparing AI vs human mean scores per video to see if there's a significant difference in magnitude.For instance, suppose humans rate engagement on average 3.5 and the AI rates the same videos with an average of 4.2; a t-test might show a significant difference, implying the AI tool has an optimistic bias for engagement. We will report such findings.4. Qualitative Thematic Analysis of Comments: We will analyze the qualitative feedback comments from both the AI and humans. Using thematic analysis, we’ll identify common themes or points raised for each criterion. For example, under Clarity, human comments might frequently mention “jargon was not explained” for certain videos, while the AI comments might say “the explanation could be more clear.” We will look at whether the AI picks up the same issues as humans do. If the AI generates narrative explanations (via the LLM) for each score, we will compare those explanations to the human-written feedback. This part is qualitative: we might highlight a few cases as examples: Case where AI and Humans agree: e.g., Video X: Both AI and humans noted that the video had an enthusiastic tone and clear structure, but contained a factual error about the timeline of an event. This shows convergence in evaluation. Case where AI missed something humans saw: e.g., Video Y: Humans praised the creativity (perhaps the video used a novel metaphor) but the AI did not mention it and gave a middling creativity score. We’d analyze why – possibly the AI didn’t recognize the metaphor as creative. Case where AI flagged an issue humans didn’t: e.g., Video Z: The AI analyzer might label the tone as “negative” due to certain words, but humans felt the tone was fine. This could indicate a misinterpretation by sentiment analysis. We will code human and AI comments for each aspect to see categories of feedback (like “good structure,” “minor factual mistake,” “monotone delivery,” etc.) and then see overlap vs unique points from AI.5. Inter-Rater Reliability (Human): We will calculate the agreement among human evaluators for context. Using the human ratings, we’ll compute either a Cronbach’s alpha or ICC across raters for each criterion. If human agreement is high (say ICC > 0.8) for clarity but low (ICC ~0.5) for creativity, it tells us creativity is inherently a more subjective aspect even among people. This context helps in interpreting the AI alignment: expecting AI to perfectly match humans on an aspect humans themselves differ on would be unrealistic.6. Statistical Significance: For any numeric comparisons (correlations, differences), we will consider statistical significance (p-values) given our sample size of ~20 videos. We anticipate using non-parametric tests (Spearman, Wilcoxon) if the distribution of scores is not normal or if our sample is on the smaller side. However, the main point is the magnitude of agreement or disagreement, which we will illustrate with graphs. For example, we might present a scatter plot per criterion with human scores on one axis and AI on the other to visually inspect alignment.7. Triangulation of Results: Finally, we will triangulate the quantitative and qualitative findings. If, say, accuracy has a high agreement quantitatively, and qualitatively we see the AI caught most factual errors similarly to humans, we conclude the AI is effective in assessing accuracy. If engagement shows low correlation and we find the AI’s basis for engagement (like tone or visuals count) doesn’t match what humans felt, we conclude this aspect is where AI currently falls short.The analysis will result in a comprehensive comparison, identifying for each qualitative criterion whether the AI analyzer is comparable to human assessment (and could be used reliably) or divergent (and thus not yet dependable without human oversight).Evaluation Criteria (Qualitative Aspects) to be ComparedFor clarity, here are the qualitative aspects we will compare between the AI analyzer tool and human-based evaluation, along with their definitions as used in this study: Emotional Tone: The affective character of the video’s narration and content. Is the tone enthusiastic, neutral, persuasive, humorous, somber, etc., and is it appropriate for the subject matter? A positive, inviting tone can improve viewer engagement, whereas a flat or mismatched tone can hinder it. Example: An AI might detect a predominantly positive tone if the language is upbeat, while a human might describe the tone as “energetic and student-friendly.” We compare such judgments. Content Accuracy: The correctness of information presented in the video. Are all factual statements true and up-to-date? This also covers absence of misconceptions and the presence of necessary details to substantiate claims. Example: The AI fact-checker might flag a date that doesn’t align with known history, and a human evaluator might likewise mark a point off for that same error. Conversely, if the AI misses a subtle inaccuracy (or hallucinated one), we note that difference. Clarity: How clearly the video communicates its message. This includes clarity of speech (diction, audibility) and clarity of explanation (no unexplained jargon, clear definitions, and examples). It addresses whether the intended audience can easily understand the content. Example: A human might say “The explanation of the experiment was very clear and easy to follow,” giving a high clarity score. We expect the AI, via NLP analysis, to also rate clarity high if sentences are well-structured and vocabulary is appropriate. Coherence: The logical flow and organization of content in the video. Does it progress in a sensible order (introduction, supporting points, conclusion)? Do ideas connect well? Example: If a video jumps around topics, human evaluators will note its poor coherence. An AI coherence metric or LLM analysis can similarly detect if the text seems disjointed or out of logical sequence. We will compare notes on whether any video is labeled “disorganized” by humans and if the AI agrees. Engagement: The ability of the video to maintain interest and engage the viewer. This is somewhat subjective, considering factors like how captivating the content and presentation are, whether it evokes curiosity or emotional response, and if it’s well-paced. Example: Humans might judge engagement by their own reaction (did the video keep them interested? was it boring?). The AI might infer engagement from proxies like variation in tone and visuals. We’ll compare an AI’s engagement rating to the average engagement score from humans to see alignment. Creativity: The originality and inventiveness of the video. Does it present the content in a unique way or just regurgitate textbook information? Creativity can manifest in the narrative (e.g. framing a lesson as a story or game) or in visual metaphors, examples, and style. Example: A human might give high creativity points if a student used a clever analogy or had the AI avatar role-play a scenario. The AI’s ability to judge creativity is unproven, but we will see if the LLM remarks on anything “novel” or if certain unusual elements lead it to give a higher score. It’s possible the AI will be more conservative here (e.g. many videos might get a middling score unless something obviously distinctive like a poem or skit was done). These six aspects form the core of our comparison. By defining them explicitly, we ensure both the AI analysis and the human rubric target the same constructs. During analysis, we will maintain this alignment (e.g. if the AI provides a single “structure” score, we map that to our coherence criterion, etc.). The study’s outcomes will be discussed in terms of these aspects – highlighting, for example, “tone analysis by AI matched human perception in X% of cases, but creativity was a point of divergence” – to give educators insight into which qualitative dimensions of student work AI can assess reliably.