Atuhurra Jesse

Ph.D. Candidate, Natural Language Processing (NLP), Google Scholar

Nara Institute of Science and Technology (NAIST)

Hi There 👋

I am a PhD student at the Natural Language Processing Lab of NAIST, working with Prof. Taro Watanabe. I’m grateful for Hidetaka Kamigaito, Hiroyuki Shindo, and Hiroki Ouchi guidance too.

My NLP research interests lie in Information Extraction (named entity recogntion, entity linking), Knowledge Graphs, Multimodal AI, prompting in Large Language Models (LLMs) and Low-resource NLP. Broadly speaking, I am passionate about applying deep learning approaches to enable machines to understand human language, and facilitate communication between humans and social robots.

I’m interning with Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI) especially the DFKI Lab Berlin to construct language resources for Swahili and North European languages.

I’m a Graduate Research Assistant at Tokyo Institute of Technology (now the Institute of Science Tokyo) in the Department of Computer Science at the School of Computing where I am investigating Large Reasoning Models (LRMs) and Long-context for Action Prediction in Robot Perception. I work with Koichiro Yoshino’s Lab.

I’m working on social robots under the Guardian Robot Project at RIKEN where I am specifically contributing to First-person Multimodal Perception through Attribute collection and Vision Language Models (VLMs). I work with Koichiro Yoshino.

I undertook a research internship at Fujitsu AI Lab where I worked on Multimodal Information Extraction. I worked with Prof. Tomoya Iwakura and Tatsuya Hiraoka.

I was affiliated with HONDA Research Institute Japan (HRI-JP) as a Part-time Researcher, and my work primarily focused on Intent Recognition in Language for a Social Robot. I collaborated with Eric Nichols and Anton de la Fuente.

Previously, I worked on Intrusion Detection in IoT networks in the Large-scale Systems Management Lab of NAIST during my master’s degree where Prof. Shoji Kasahara advised me.

I spent time as a Research Student at Kyoto University, in Yoshikawa Lab in the Graduate School of Informatics . While there, I was supervised and mentored by Prof. Masatoshi Yoshikawa on several methods mainly related to Information Retrieval, Databases, Human-Computer Interface design and Artificial Intelligence.

My graduate studies are fully funded by the Japanese government’s MEXT scholarship for which I am incredibly grateful.

Activities/News:

[October 2025] Visited Alibaba Global HQ, DeepRobotics & Zhejiang University Zijingang Campus 🇨🇳.
[July 2025] Started R.A position at Tokyo Tech (Sch. of Computing) 🇯🇵.
[June 2025] One paper accepted to IROS 2025 🇨🇳.
[Apr. 2025] Built Japanese Medical-insurance Knowledge Graph at InfoDeliver 🇯🇵.
[Dec. 2024] Commenced research internship at DFKI Berlin 🇩🇪.
[Jan. 2024] Commenced work on multimodal perception for Robots, at RIKEN 🇯🇵.
[Sept. 2023] Started research internship at Fujitsu AI Lab 🇯🇵.
[Oct. 2021] Completed research internship, starting a new role at Honda 🇯🇵.
[Sept. 2021] Selected to participate in the AllenNLP Hacks 2021 🇺🇸.

Interests

Natural Language Processing
Multimodal Foundation Models
Social Robotics
Human—Robot Interaction
Representation Learning

Education

PhD Information Science and Engineering (Expected), 2025
Nara Institute of Science and Technology, Japan
MEng Information Science and Engineering, 2022
Nara Institute of Science and Technology, Japan
Research Student, 2020
Kyoto University, Japan
BEng in Telecommunications Engineering, Jan 2016
Kyambogo University, Uganda

Experience

Research Assistant

Tokyo Tech

Jul 2025 – Present Tokyo, Japan

Research: Large Reasoning Models and Long-Context LLMs for Robot Action-Prediction.

Research Intern

German Research Center for Artificial Intelligence (DFKI)

Dec 2024 – Present Berlin, Germany

Research: Construct language resources for Swahili and North European languages.

Research Assistant

RIKEN (R-IH)

Jan 2024 – Present Sorakugun, Kyoto, Japan

Research: Multimodal Perception in Social Robots.

Research Intern

Fujitsu AI Lab

Sep 2023 – Dec 2023 Kawasaki, Kanagawa, Japan

Research: Multimodal Information Extraction, Vision Language Models (VLMs).

Part time Researcher

HONDA

Nov 2021 – Mar 2023 Wako-shi, Saitama, Japan

Research: Named Entity Recognition, Knowledge Bases, Knowledge Graphs.

Research Intern

HONDA

Jul 2021 – Oct 2021 Wako-shi, Saitama, Japan

Research: Intent Recognition in Language for HARU.

Graduate Student (MEng)

NAIST

Apr 2020 – Mar 2022 Ikoma, Nara, Japan

I completed Master’s degree in the Large-scale Systems Management Lab where I worked on Intrusion Detection with Prof. Shoji Kasahara.

English Language Instructor

Gaba Corporation

Aug 2018 – Mar 2022 Kyoto/Osaka, Japan

Research Student

Kyoto University

Apr 2018 – Mar 2020 Kyoto, Japan

As a Research Student, I was actively mentored and supervised by Prof. Masatoshi Yoshikawa on Information Retrieval, Databases, Human Computer Interface design and Artificial Intelligence methods.

Junior Researcher

United Nations Global Pulse Lab

Feb 2017 – Jul 2017 Kampala, Uganda

My work mainly included Big Data Analysis and the collection of GIS data.

Publications

Conferences & Preprints

Please find all my publications on Google Scholar

Thesis

Dealing with Imbalanced Classes in Bot-IoT Dataset

Jesse Atuhurra
M.Eng Information Science and Engineering

Conferences

J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino
arXiv:2510.21761
IROS 2025. Hangzhou, Zhejiang, China. October 19-25, 2025.

HLU: Human vs. LLM Generated Text Detection Dataset for Urdu at Multiple Granularities

Iqra Ali, Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe
COLING 2025. Abu Dhabi, UAE. January 19–24, 2025.

Zero-shot Retrieval of User Intent in Human-Robot Interaction with Large Language Models

Jesse Atuhurra
IEEE MIPR 2024. San Jose, CA, USA. August 7-9, 2024.

The Impact of Large Language Models on Social Robots: Potential Benefits and Challenges

Jesse Atuhurra
Assistive Robots @ RSS 2024. Delft, Netherlands. July 15-19, 2024.

Preprints

VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

Jesse Atuhurra, Iqra Ali, Tomoya Iwakura, Hidetaka Kamigaito, Tatsuya Hiraoka
arXiv:2510.12845

NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER

Jesse Atuhurra, Hidetaka Kamigaito, Hiroki Ouchi, Hiroyuki Shindo, Taro Watanabe
arXiv:2412.09634

Leveraging Large Language Models in Human-Robot Interaction: A Critical Analysis of Potential and Pitfalls

Jesse Atuhurra
arXiv:2405.00693

Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences

Jesse Atuhurra, Hidetaka Kamigaito
arXiv:2404.08666

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe
arXiv:2406.15359

Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili

Jesse Atuhurra, Hiroyuki Shindo, Hidetaka Kamigaito, Taro Watanabe
arXiv:2406.15358

Domain Adaptation in Intent Classification Systems: A Review

Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Eric Nichols
arXiv:2404.14415

Image Classification for CSSVD Detection in Cacao Plants

Jesse Atuhurra, N'guessan Yves-Roland Douha, Pabitra Lenka
arXiv:2405.04535

Enrich Robots with Updated Knowledge in the Wild via Large Language Models

Jesse Atuhurra
RG.2.2.15798.31048

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe
arXiv:2403.15430

Projects

Descriptions & Datasets

Distillation of Bio-species Infomation from LLMs

This project focuses on creating datasets for Named Entity Recognition (NER) and Relation Extraction (RE) in the domain of endangered species by distilling knowledge from GPT-4. We generated synthetic data about four classes of endangered species (amphibians, arthropods, birds, and fishes) using GPT-4, which was then verified by humans using external knowledge bases like IUCN and Wikipedia. The final dataset contains 3.6K sentences evenly split between NER and RE tasks, with annotations for species, habitat, feeding, and breeding entities, along with their relationships. We fine-tuned various BERT models (standard BERT, BioBERT, and PubMedBERT) on this dataset, with PubMedBERT achieving the best performance at 94.14% F1-score. Last, we demonstrated that GPT-4 performs better than UniversalNER-7B in zero-shot NER tasks on both easy and hard examples, confirming GPT-4's effectiveness as a teacher model for knowledge distillation in this domain.
[PDF] [Code] [Data (HF)]

Large-scale NER Dataset Construction

We introduce RapidNER, a framework for efficiently creating named entity recognition (NER) datasets for new domains, with a focus on human-robot interaction. The framework operates through three key steps: 1) extracting domain-specific knowledge from Wikidata using instance-of and subclass-of relations, 2) collecting diverse texts from Wikipedia, Reddit, and Stack Exchange, and 3) implementing an efficient annotation scheme using Elasticsearch. We demonstrate the framework by creating NERsocial, a new dataset containing 153K tokens, 134K entities, and 99.4K sentences across six entity types relevant for social interactions: drinks, foods, hobbies, jobs, pets, and sports. When fine-tuned on NERsocial, transformer models like BERT, RoBERTa, and DeBERTa-v3 achieve F1-scores above 95%. The framework significantly reduces dataset creation time and effort while maintaining high quality, as evidenced by a 90.6% inter-annotator agreement.
[PDF] [Code] [Data (HF)] [Website]

J-ORA: A Robot Perception Framework for Japanese

J-ORA is a novel framework and dataset designed to advance research at the intersection of robotics, vision, and language understanding in non-English settings, specifically Japanese. Developed through a collaboration between NAIST and the RIKEN Guardian Robot Project, J-ORA addresses key challenges in robot perception, including ambiguity in object reference, dynamic scene understanding, and multimodal instruction grounding. The framework automates the annotation of a multimodal dataset consisting of 142 real-world image-dialogue pairs, captured from a robot’s egocentric viewpoint. Each instance includes detailed object-attribute annotations, dialogue utterances in Japanese, bounding boxes, and grounded references, enabling evaluation across three core tasks: Object Identification, Reference Resolution, and Next Action Prediction. J-ORA further integrates real-world dynamics such as object occlusions, overlapping visual features, and temporal context to evaluate fine-grained multimodal reasoning. The dataset supports training and evaluation of Vision-Language Models (VLMs) under zero-shot and fine-tuned settings, and includes comparisons across proprietary models (e.g., GPT-4o, Gemini) and open-source Japanese VLMs. By addressing gaps in multilingual and grounded robotics datasets, J-ORA lays the foundation for building more perceptive, culturally adaptive, and interactive domestic service robots. [PDF] [Code] [Data (HF)] [Website]

Visual-Language (VLURes) Benchmark

VLURes is a comprehensive multilingual benchmark designed to evaluate and advance the capabilities of Vision-Language Models (VLMs) across diverse linguistic and cultural contexts. It introduces eight vision-and-language tasks—ranging from Scene Understanding, Relation Understanding, Semantic Segmentation, Image Captioning, Image-Text Matching, Visual Question Answering, to novel tasks like Unrelatedness Detection and Multilingual Transfer—spanning both image-only and image-text modalities. Unlike existing benchmarks that focus primarily on English (or occasionally Chinese), VLURes includes four languages: English (En), Japanese (Jp), Swahili (Sw), and Urdu (Ur), with a special emphasis on low-resource languages that are often underrepresented in AI research. The benchmark contains 1,000 high-quality image-text pairs per language, each embedded in article-level long-form text, allowing rigorous testing of discourse-level grounding and cross-modal reasoning. VLURes also evaluates zero-shot and one-shot generalization, with and without rationales, and introduces fine-tuning experiments to assess language transfer. Through extensive evaluation of SOTA proprietary and open-source VLMs (e.g., GPT-4o, Gemini, Llava, Qwen2VL), VLURes reveals persistent performance gaps, especially in Swahili and Urdu, underscoring the urgent need for equitable, globally aware multimodal AI.
[PDF] [Code] [Data (HF)] [Website]

Zero-shot Intent Recognition

This research project investigates zero-shot user intent classification in human-robot interaction using large language models (LLMs). We created a new dataset containing 33,812 sentences across four languages (English, Japanese, Swahili, and Urdu) and six intent classes (pet, food, job, hobby, sport, and drink). We leveraged Wikidata knowledge graphs to extract sentences from Wikipedia articles and tested six different prompting methods with various LLMs including GPT-4, Claude 3, and Gemma. The experiments demonstrated that well-crafted prompts, utilizing adavanced prompting methods, enabled LLMs to achieve high accuracy in intent classification without requiring fine-tuning or example data, with GPT-4 and Claude 3 achieving nearly 95% accuracy across all languages. The study also showed that retrieval-augmented generation (RAG) improved classification performance, and simple zero-shot prompting was sufficient for achieving competitive results, especially with more capable LLMs like GPT-4 and Claude 3 Opus.
[PDF] [Code] [Data (HF)]

Contact

Email(s): atuhurra.jesse.ag2 [at] is.naist.jp OR atuhurrajesse [at] gmail.com