Atuhurra Jesse

Atuhurra Jesse

Ph.D. Candidate, Natural Language Processing (NLP), Google Scholar

Nara Institute of Science and Technology (NAIST)

Hi There đź‘‹

I am a PhD student at the Natural Language Processing Lab of NAIST, working with Prof. Taro Watanabe. I’m grateful for Hidetaka Kamigaito, Hiroyuki Shindo, and Hiroki Ouchi guidance too.

My NLP research interests lie in Information Extraction (named entity recogntion, entity linking), Knowledge Graphs, Multimodal AI, prompting in Large Language Models (LLMs) and Low-resource NLP. Broadly speaking, I am passionate about applying deep learning approaches to enable machines to understand human language, and facilitate communication between humans and social robots.

I’m interning with Deutsches Forschungszentrum fĂĽr KĂĽnstliche Intelligenz GmbH (DFKI) especially the DFKI Lab Berlin to construct language resources for Swahili and North European languages.

I’m working on social robots under the Guardian Robot Project at RIKEN where I am specifically contributing to First-person Multimodal Perception through Attribute collection and Vision Language Models (VLMs). I work with Koichiro Yoshino.

I undertook a research internship at Fujitsu AI Lab where I worked on Multimodal Information Extraction. I worked with Prof. Tomoya Iwakura and Tatsuya Hiraoka.

I was affiliated with HONDA Research Institute Japan (HRI-JP) as a Part-time Researcher, and my work primarily focused on Intent Recognition in Language for a Social Robot. I collaborated with Eric Nichols and Anton de la Fuente.

Previously, I worked on Intrusion Detection in IoT networks in the Large-scale Systems Management Lab of NAIST during my master’s degree where Prof. Shoji Kasahara advised me.

I spent time as a Research Student at Kyoto University, in Yoshikawa Lab in the Graduate School of Informatics . While there, I was supervised and mentored by Prof. Masatoshi Yoshikawa on several methods mainly related to Information Retrieval, Databases, Human-Computer Interface design and Artificial Intelligence.

My graduate studies are fully funded by the Japanese government’s MEXT scholarship for which I am incredibly grateful.

Activities/News:
  • [July 2025] Started R.A position at Tokyo Tech (Sch. of Computing) 🇯🇵.
  • [June 2025] One paper accepted to IROS 2025 🇨🇳.
  • [Apr. 2025] Built Japanese Medical-insurance Knowledge Graph at InfoDeliver 🇯🇵.
  • [Dec. 2024] Commenced research internship at DFKI Berlin 🇩🇪.
  • [Jan. 2024] Commenced work on multimodal perception for Robots, at RIKEN 🇯🇵.
  • [Sept. 2023] Started research internship at Fujitsu AI Lab 🇯🇵.
  • [Oct. 2021] Completed research internship, starting a new role at Honda 🇯🇵.
  • [Sept. 2021] Selected to participate in the AllenNLP Hacks 2021 🇺🇸.

Web Analytics

Interests
  • Natural Language Processing
  • Multimodal Foundation Models
  • Social Robotics
  • Human—Robot Interaction
  • Representation Learning
Education
  • PhD Information Science and Engineering (Expected), 2025

    Nara Institute of Science and Technology, Japan

  • MEng Information Science and Engineering, 2022

    Nara Institute of Science and Technology, Japan

  • Research Student, 2020

    Kyoto University, Japan

  • BEng in Telecommunications Engineering, Jan 2016

    Kyambogo University, Uganda

Experience

 
 
 
 
 
Tokyo Tech
Research Assistant
Jul 2025 – Present Tokyo, Japan
Research: Large Reasoning Models and Long-Context LLMs for Robot Action-Prediction.
 
 
 
 
 
German Research Center for Artificial Intelligence (DFKI)
Research Intern
Dec 2024 – Present Berlin, Germany
Research: Construct language resources for Swahili and North European languages.
 
 
 
 
 
RIKEN (R-IH)
Research Assistant
Jan 2024 – Present Sorakugun, Kyoto, Japan
Research: Multimodal Perception in Social Robots.
 
 
 
 
 
Fujitsu AI Lab
Research Intern
Sep 2023 – Dec 2023 Kawasaki, Kanagawa, Japan
Research: Multimodal Information Extraction, Vision Language Models (VLMs).
 
 
 
 
 
HONDA
Part time Researcher
Nov 2021 – Mar 2023 Wako-shi, Saitama, Japan
Research: Named Entity Recognition, Knowledge Bases, Knowledge Graphs.
 
 
 
 
 
HONDA
Research Intern
Jul 2021 – Oct 2021 Wako-shi, Saitama, Japan
Research: Intent Recognition in Language for HARU.
 
 
 
 
 
NAIST
Graduate Student (MEng)
Apr 2020 – Mar 2022 Ikoma, Nara, Japan
I completed Master’s degree in the Large-scale Systems Management Lab where I worked on Intrusion Detection with Prof. Shoji Kasahara.
 
 
 
 
 
Gaba Corporation
English Language Instructor
Aug 2018 – Mar 2022 Kyoto/Osaka, Japan
 
 
 
 
 
Kyoto University
Research Student
Apr 2018 – Mar 2020 Kyoto, Japan
As a Research Student, I was actively mentored and supervised by Prof. Masatoshi Yoshikawa on Information Retrieval, Databases, Human Computer Interface design and Artificial Intelligence methods.
 
 
 
 
 
United Nations Global Pulse Lab
Junior Researcher
Feb 2017 – Jul 2017 Kampala, Uganda
My work mainly included Big Data Analysis and the collection of GIS data.

Publications

Conferences & Preprints

Please find all my publications on Google Scholar

Thesis

Dealing with Imbalanced Classes in Bot-IoT Dataset
Jesse Atuhurra
M.Eng Information Science and Engineering

Conferences

J-ORA: A Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino
IROS 2025. Hangzhou, Zhejiang, China. October 19-25, 2025. (to appear)
HLU: Human vs. LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
Iqra Ali, Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe
COLING 2025. Abu Dhabi, UAE. January 19–24, 2025.
Zero-shot Retrieval of User Intent in Human-Robot Interaction with Large Language Models
Jesse Atuhurra
IEEE MIPR 2024. San Jose, CA, USA. August 7-9, 2024.
The Impact of Large Language Models on Social Robots: Potential Benefits and Challenges
Jesse Atuhurra
Assistive Robots @ RSS 2024. Delft, Netherlands. July 15-19, 2024.

Preprints

NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER
Jesse Atuhurra, Hidetaka Kamigaito, Hiroki Ouchi, Hiroyuki Shindo, Taro Watanabe
arXiv:2412.09634
Leveraging Large Language Models in Human-Robot Interaction: A Critical Analysis of Potential and Pitfalls
Jesse Atuhurra
arXiv:2405.00693
Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences
Jesse Atuhurra, Hidetaka Kamigaito
arXiv:2404.08666
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe
arXiv:2406.15359
Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili
Jesse Atuhurra, Hiroyuki Shindo, Hidetaka Kamigaito, Taro Watanabe
arXiv:2406.15358
Domain Adaptation in Intent Classification Systems: A Review
Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Eric Nichols
arXiv:2404.14415
Image Classification for CSSVD Detection in Cacao Plants
Jesse Atuhurra, N'guessan Yves-Roland Douha, Pabitra Lenka
arXiv:2405.04535
Enrich Robots with Updated Knowledge in the Wild via Large Language Models
Jesse Atuhurra
RG.2.2.15798.31048
Distilling Named Entity Recognition Models for Endangered Species from Large Language Models
Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe
arXiv:2403.15430

Projects

Descriptions & Datasets

Distillation of Bio-species Infomation from LLMs

This project focuses on creating datasets for Named Entity Recognition (NER) and Relation Extraction (RE) in the domain of endangered species by distilling knowledge from GPT-4. We generated synthetic data about four classes of endangered species (amphibians, arthropods, birds, and fishes) using GPT-4, which was then verified by humans using external knowledge bases like IUCN and Wikipedia. The final dataset contains 3.6K sentences evenly split between NER and RE tasks, with annotations for species, habitat, feeding, and breeding entities, along with their relationships. We fine-tuned various BERT models (standard BERT, BioBERT, and PubMedBERT) on this dataset, with PubMedBERT achieving the best performance at 94.14% F1-score. Last, we demonstrated that GPT-4 performs better than UniversalNER-7B in zero-shot NER tasks on both easy and hard examples, confirming GPT-4's effectiveness as a teacher model for knowledge distillation in this domain.
[PDF] [Code] [Data (HF)]

Large-scale NER Dataset Construction

We introduce RapidNER, a framework for efficiently creating named entity recognition (NER) datasets for new domains, with a focus on human-robot interaction. The framework operates through three key steps: 1) extracting domain-specific knowledge from Wikidata using instance-of and subclass-of relations, 2) collecting diverse texts from Wikipedia, Reddit, and Stack Exchange, and 3) implementing an efficient annotation scheme using Elasticsearch. We demonstrate the framework by creating NERsocial, a new dataset containing 153K tokens, 134K entities, and 99.4K sentences across six entity types relevant for social interactions: drinks, foods, hobbies, jobs, pets, and sports. When fine-tuned on NERsocial, transformer models like BERT, RoBERTa, and DeBERTa-v3 achieve F1-scores above 95%. The framework significantly reduces dataset creation time and effort while maintaining high quality, as evidenced by a 90.6% inter-annotator agreement.
[PDF] [Code] [Data (HF)] [Website]

J-ORA: A Robot Perception Framework for Japanese

J-ORA is a novel benchmark and dataset designed to advance research at the intersection of robotics, vision, and language understanding in non-English settings, specifically Japanese. Developed through a collaboration between NAIST and the RIKEN Guardian Robot Project, J-ORA addresses key challenges in robot perception, including ambiguity in object reference, dynamic scene understanding, and multimodal instruction grounding. The benchmark provides a richly annotated multimodal dataset consisting of 142 real-world image-dialogue pairs, captured from a robot’s egocentric viewpoint. Each instance includes detailed object-attribute annotations, dialogue utterances in Japanese, bounding boxes, and grounded references, enabling evaluation across three core tasks: Object Identification, Reference Resolution, and Next Action Prediction. J-ORA further integrates real-world dynamics such as object occlusions, overlapping visual features, and temporal context to evaluate fine-grained multimodal reasoning. The dataset supports training and evaluation of Vision-Language Models (VLMs) under zero-shot and fine-tuned settings, and includes comparisons across proprietary models (e.g., GPT-4o, Gemini) and open-source Japanese VLMs. By addressing gaps in multilingual and grounded robotics datasets, J-ORA lays the foundation for building more perceptive, culturally adaptive, and interactive domestic service robots. [Code] [Data (HF)] [Website]

Visual-Language (VLURes) Benchmark

VLURes is a comprehensive multilingual benchmark designed to evaluate and advance the capabilities of Vision-Language Models (VLMs) across diverse linguistic and cultural contexts. It introduces eight vision-and-language tasks—ranging from Scene Understanding, Relation Understanding, Semantic Segmentation, Image Captioning, Image-Text Matching, Visual Question Answering, to novel tasks like Unrelatedness Detection and Multilingual Transfer—spanning both image-only and image-text modalities. Unlike existing benchmarks that focus primarily on English (or occasionally Chinese), VLURes includes four languages: English (En), Japanese (Jp), Swahili (Sw), and Urdu (Ur), with a special emphasis on low-resource languages that are often underrepresented in AI research. The benchmark contains 1,000 high-quality image-text pairs per language, each embedded in article-level long-form text, allowing rigorous testing of discourse-level grounding and cross-modal reasoning. VLURes also evaluates zero-shot and one-shot generalization, with and without rationales, and introduces fine-tuning experiments to assess language transfer. Through extensive evaluation of state-of-the-art proprietary and open-source VLMs (e.g., GPT-4o, Gemini, Llava, Qwen2VL), VLURes reveals persistent performance gaps, especially in Swahili and Urdu, underscoring the urgent need for equitable, globally aware multimodal AI.
[PDF] [Code] [Data (HF)] [Website]

Zero-shot Intent Recognition

This research project investigates zero-shot user intent classification in human-robot interaction using large language models (LLMs). We created a new dataset containing 33,812 sentences across four languages (English, Japanese, Swahili, and Urdu) and six intent classes (pet, food, job, hobby, sport, and drink). We leveraged Wikidata knowledge graphs to extract sentences from Wikipedia articles and tested six different prompting methods with various LLMs including GPT-4, Claude 3, and Gemma. The experiments demonstrated that well-crafted prompts, utilizing adavanced prompting methods, enabled LLMs to achieve high accuracy in intent classification without requiring fine-tuning or example data, with GPT-4 and Claude 3 achieving nearly 95% accuracy across all languages. The study also showed that retrieval-augmented generation (RAG) improved classification performance, and simple zero-shot prompting was sufficient for achieving competitive results, especially with more capable LLMs like GPT-4 and Claude 3 Opus.
[PDF] [Code] [Data (HF)]

Contact

Email(s): atuhurra.jesse.ag2 [at] is.naist.jp OR atuhurrajesse [at] gmail.com