NII Repository (National Institute of Informatics)
Not a member yet
    2035 research outputs found

    PanguIR Technical Report for NTCIR-18 AEOLLM Task

    Full text link
    As large language models (LLMs) gain widespread attention in both academia and industry, it becomes increasingly critical and challenging to effectively evaluate their capabilities. Existing evaluation methods can be broadly categorized into two types: manual evaluation and automatic evaluation. Manual evaluation, while comprehensive, is often costly and resource-intensive. Conversely, automatic evaluation offers greater scalability but is constrained by the limitations of its evaluation criteria (dominated by reference-based answers). To address these challenges, NTCIR-18\footnote{https://research.nii.ac.jp/ntcir/ntcir-18/tasks.html#AEOLLM} introduced the AEOLLM (Automatic Evaluation of LLMs) task, aiming to encourage reference-free evaluation methods that can overcome the limitations of existing approaches. In this paper, to enhance the evaluation performance of the AEOLLM task, we propose three key methods to improve the reference-free evaluation: 1) Multi-model Collaboration: Leveraging multiple LLMs to approximate human ratings across various subtasks; 2) Prompt Auto-optimization: Utilizing LLMs to iteratively refine the initial task prompts based on evaluation feedback from training samples; and 3) In-context Learning (ICL) Optimization: Based on the multi-task evaluation feedback, we train a specialized in-context example retrieval model, combined with a semantic relevance retrieval model, to jointly identify the most effective in-context learning examples. Experiments conducted on the final dataset demonstrate that our approach achieves superior performance on the AEOLLM task.conference pape

    ASUKAI89 at NTCIR 18 RadNLP Task: Lung Cancer Staging Automatic Classification System Utilizing Large Language Models and Meta-Prompting

    Full text link
    This study aims to develop and evaluate a system that automatically extracts the TNM classification of lung cancer (T: primary tumor, N: lymph node metastasis, M: distant metastasis) from radiological diagnosis reports. In the initial experiments, inference was performed using `gemini-2.0-flash-thinking-exp-1219`. By incorporating explicit TNM classification criteria and unit specifications—features absent in conventional methods—and introducing error analysis and prompt improvements through meta-prompting, an overall accuracy improvement of approximately 15% was achieved after prompt modification. In the final evaluation, using the `o1 2024-12-01-preview` model, we achieved approximately 70% joint accuracy (fine), 76% T accuracy, 93% N accuracy, and 95% M accuracy. This paper provides a detailed account of the experimental procedures and the improvement process at each stage.conference pape

    Ubie at the NTCIR-18 RadNLP Main Task: Few-shot Classification of TNM Staging for Japanese Radiology Reports Using LLMs

    Full text link
    The Ubie team participated in the RadNLP core task on lung cancer staging classification based on Japanese radiology reports at NTCIR-18. This paper reports our approach and analyzes the official results. We investigated the impact of prompt engineering on TNM classification using large language models (LLMs). We compared multiple proprietary models available as of January 2025 (Gemini 1.5 Pro, Gemini Exp. 1206, and o1) using various prompt configurations, including zero-shot, few-shot, chain-of-thought (CoT), and self-feedbacked instruction. The results demonstrate significant performance improvements driven by model evolution in this medical text classification task. Analysis of prompt variations revealed differential impacts based on model capabilities. For Gemini models tested, explicitly prompting reasoning steps (CoT) led to the most substantial performance gains. In contrast, the o1 model, a reasoning model performing internal CoT and self-evaluation, showed limited benefit from explicit reasoning prompts, suggesting that strategies effective for non-reasoning models are less critical for advanced reasoning models. This finding, consistent with general guidance on prompting reasoning models, is also observed in our medical text classification experiments. The effectiveness of self-feedbacked instruction varied, showing no improvement for Gemini 1.5 Pro, possibly due to inadequate feedback generation and its dependence on factors like few-shot example selection. While prompt engineering offered limited gains for the reasoning model evaluated, it provided substantial performance benefits for non-reasoning models, highlighting its value for optimizing models without inherent advanced reasoning capabilities.conference pape

    Overview of the NTCIR-18 Transfer-2 Task

    Full text link
    This paper provides an overview of the NTCIR-18 Transfer-2 task that aims to bring together researchers from Information Retrieval, Machine Learning, and Natural Language Processing to develop a suite of technology for transferring resources generated for one purpose to another in the context of dense retrieval. Two subtasks were run for this round: the Retrieval Augmented Generation (RAG) subtask and the Dense Multimodal Retrieval (DMR) subtask. This paper presents the dataset developed and evaluation results of participant runs. Note that this paper includes material from our earlier work published in~\cite{emtcir04}, revised for the current work.conference pape

    KNUIR at the NTCIR-18 AEOLLM: Automatic Evaluation of LLMs

    Full text link
    In this study, we aim to propose automated evaluation methods of LLMs that approximate human judgment by exploring and comparing two distinct approaches: (1) LLM-based scoring, which utilizes GPT models with prompt engineering, and (2) feature-based machine learning, using transformer-based metrics such as BERTScore, semantic similarity, and keyword coverage. As part of this research, we participated in the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. We submitted the results of the test data set and the reserved data set to NTCIR-18 and analyzed the results obtained. The results show that GPT-4o Mini (with the updated prompt) achieved the highest performance, while the feature-based approach performed competitively, surpassing GPT-3.5 Turbo and showing a small gap with GPT-4o Mini. LLM-based methods offered scalability but lacked explainability, whereas feature-based approaches provided better interpretability but required extensive tuning, highlighting the trade-offs between the two strategies. Throughout the analysis, We expect that the findings of our work will provide insights into the understanding of human judgment and automated evaluation of LLMs.conference pape

    COPWA at the NTCIR-18 FairWeb-2 Task

    Full text link
    This paper describes our participation in the Conversational Search Subtask of the FairWeb-2 Task at NTCIR-18. Our system, COPWA, was designed to balance conversational relevance and group fair ness while retrieving entities from researcher, movie, and YouTube content topics. We detail our approach, evaluation results, and analysis of our system’s performance using the GFRC (Group Fairness and Relevance of Conversations) framework.conference pape

    OURad at the NTCIR-18 RadNLP Task: Predicting Lung Cancer Clinical Staging from Radiology Reports Using Few-Shot Prompting of Large Language Models

    Full text link
    In this paper, we describe our proposed systems for the Japanese main task and sub task in Natural Language Processing for Radiology 2024 shared task. We employed Generative Pre-trained Transformer models and applied a few-shot prompting approach to tackle the classification task for lung cancer TNM staging from free-text radiology reports. Our method first performs zero-shot prompting using training data and then refines the final predictions by incorporating examples of incorrect predictions into the prompt. We demonstrate that this approach outperforms several BERT-based models and other open-source large language models. On the test data, our method achieved a Joint Accuracy (fine) of 0.732 for the main task and an overall micro F2.0 of 0.688 for the sub task, ranking 3rd in both categories.conference pape

    令和7年度第1回研究データ基盤運営委員会議事要旨

    Full text link
    conference outpu

    SPARC Japan セミナー2024 「オープンアクセス義務化の先にあるもの:来るべき世界に向けて」 日本における研究力強化とオープンアクセス ドキュメント

    Full text link
    SPARC Japan セミナー2024「オープンアクセス義務化の先にあるもの:来るべき世界に向けて」 開催場所:オンライン開催 日時:2025年1月30日(木)13:00~17:00conference presentatio

    SPARC Japan セミナー2024 「オープンアクセス義務化の先にあるもの:来るべき世界に向けて」 ライフサイエンスにおけるオープンアクセスの歴史 ドキュメント

    Full text link
    SPARC Japan セミナー2024「オープンアクセス義務化の先にあるもの:来るべき世界に向けて」 開催場所:オンライン開催 日時:2025年1月30日(木)13:00~17:00conference presentatio

    2,022

    full texts

    2,035

    metadata records
    Updated in last 30 days.
    NII Repository (National Institute of Informatics)
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇