Dataset Page

SciEGQA

A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning

Wenhan Yu · Zhaoxi Zhang · Wang Chen · Guanqiang Qi · Weikang Li · Lei Sha · Deguo Xia · Jizhou Huang

SciEGQA targets a practical but under-explored setting: answering questions on scientific papers while explicitly grounding each answer in the visual evidence regions that support it. Each sample pairs the query and answer with evidence pages, semantic-region bounding boxes, document metadata, and multiple converted views for answer generation and grounding evaluation.

Semantic Region Grounding Scientific DocVQA Benchmark + Training set Localization + Answer evaluation

1,623

Benchmark #QA

30,780

Training Set #QA

8

arXiv categories

3

Question types

Paper Preview

SciEGQA paper preview

Benchmark

80 papers

1,941 pages

Training Set

3,671 papers

42,380 pages

Annotation

12 annotators

24 domain experts

Evidence

Evidence doc and page(s)

Evidence region(s)

Overview

Page-level QA with semantic-region evidence

Existing document QA datasets usually fall into three buckets: page-level QA without precise grounding, component-focused QA on isolated charts or tables, or token-level supervision that is precise but often too fragmented to preserve semantic completeness. SciEGQA is designed to close that gap.

The dataset keeps the question answering setup close to realistic scientific document understanding while attaching semantic evidence regions as bounding boxes. This makes it possible to evaluate both whether a model answers correctly and whether it actually found the right visual support.

Evidence Grounding Multi-View Data Evaluation Metrics
Comparison with other DocVQA datasets

SciEGQA is positioned between coarse page-level QA and overly fragmented token-level grounding by annotating semantically complete evidence regions.

At A Glance

Two complementary dataset components

Human annotated
SciEGQA Benchmark

A fine-grained benchmark built from 80 arXiv papers with manual semantic-region annotations for rigorous evaluation of scientific document understanding and evidence grounding.

  • 1,623 QA samples
  • 1,941 pages
  • 80 papers
  • Human fine-grained annotated
Automatically generated
SciEGQA Training Set

A large-scale training resource constructed through a PDF-to-region-to-QA pipeline, designed for scalable model training and extensible dataset generation.

  • 30,780 QA samples
  • 42,380 pages
  • 3,671 arXiv papers
  • Supports automatic data expansion
Data format

Overview of the raw data formats.

{
  "query": "...",
  "answer": "...",
  "evidence_page": [x],
  "bbox": [[[x1, y1, x2, y2]]],
  "rel_bbox": [[[rel_x1, rel_y1, rel_x2, rel_y2]]],
  "subimg_type": [["category"]],
  "doc_name": "xxxxxx",
  "category": "xxx"
}

Construction

From PDF pages to evidence grounded QA

SciEGQA construction pipeline

The SciEGQA dataset consists of a fine-grained benchmark and a large-scale automatically constructed training set.

Human annotated
SciEGQA Benchmark
01

Region Annotation

02

Cross Validation

03

Expert QA Design

04

Quality Assurance

Automatically generated
SciEGQA Training Set
01

Regions Segmentation

02

Regions Filtering

03

QA Generation

04

Answer Verification

Examples

Three types of Questions

SPSR

Single Page Single Region

Localized reasoning from one semantic evidence region on one page.

Benchmark46.15%
Training37.91%

SPMR

Single Page Multi Regions

Reasoning across multiple semantically related regions within the same page.

Benchmark34.26%
Training24.41%

MPMR

Multi Pages Multi Regions

Cross-page evidence integration over multiple regions, the highest complexity.

Benchmark19.59%
Training37.69%
Table example for rational entry
Single Page Single Region

In the identities table, what AI entry corresponds to the DS term "rational"?

Answer: machine

Evidence Page: [26]

Evidence Regions: [[[305, 1512, 1520, 2005]]]

Evidence Categories: [Table]

SPMR example with image and text evidence
Single Page Multi Regions

Using the axis descriptions and the wheel diagram, which segment labels mark the ends of the vertical "how vs why" axis and the horizontal "supporting practices" axis?

Answer: Vertical: Value and Analytics; Horizontal: Systems and Design.

Evidence Page: [11]

Evidence Regions: [[[637, 302, 1887, 1614], [293, 1661, 2216, 1999]]]

Evidence Categories: [Image, Text]

MPMR table evidence on page 9
MPMR figure evidence on page 5
Multi Pages Multi Regions

In Table 1, the Machine-Concrete quadrant shows which primed numeral, and in Figure 2 what three left-column actions correspond to that numeral?

Answer: III'; clean, prepare, explore

Evidence Pages: [9, 5]

Evidence Regions: [[[719, 1244, 1796, 1770]], [[593, 783, 1943, 1415]]]

Evidence Categories: [Table, Image]

Statistics

Coverage, Diversity, and Evaluation signals

arXiv category distribution

Benchmark papers are evenly sampled with 10 papers per category. The training set scales unevenly across eight domains.

q-fin785 train / 10 bench
q-bio670 train / 10 bench
econ576 train / 10 bench
cs410 train / 10 bench
stat353 train / 10 bench
physics337 train / 10 bench
eess288 train / 10 bench
math252 train / 10 bench
SciEGQA statistical analysis

The paper reports dataset statistics over category-modality combinations, question and answer lengths, spatial evidence locations, and evidence type distributions for both benchmark and training data.

Task

Definitions of Two Evaluation Tasks

We design two evaluation tasks on the SciEGQA benchmark and perform zero-shot inference to systematically evaluate a diverse set of Vision-Language Models (VLMs).

Grounding-Crop-then-Answer

Given the evidence page(s) and the corresponding question, the model predicts a set of bounding boxes that localize the regions relevant to answering the question. Each box is represented as (x1, y1, x2, y2), with coordinates normalized to the page coordinate space of [0,1000].

Metric: IoU

Metric: Answer accuracy

Grounding on evidence pages

Predict bounding boxes

Answering from predicted crop regions

Answer by crop predicted region(s)

Evaluation pipeline

  • 1. Grounding: the grounding model predicts one or more answer-relevant bounding boxes.
  • 2. Crop: crop the page image to obtain the corresponding predicted regions.
  • 3. Answer: use Qwen3.5-27B as the fixed generation model to produce the final answer.

QA under Different Evidence Granularities

We evaluate model performance under three levels of input granularity: the entire document, the evidence page(s), and the cropped evidence region(s). By comparing answer accuracy across these settings, we analyze how evidence granularity influences the question answering capability of VLMs.

Input 1
Entire document input example

Entire document

Whole document images

All pages of the source paper are provided to the model for direct answer generation.

Input 2
Evidence page input example

Evidence page(s)

Answer from evidence pages

Only the page or pages that support the answer evidence are given as input.

Input 3
Cropped evidence region input example

Cropped evidence region(s)

Answer from evidence regions

The model receives only the cropped evidence region or regions annotated for the QA pair.

Main Results

Top-line benchmark performance

The tables below summarize the main benchmark numbers. Task 1 evaluates both grounding quality and crop-then-answer performance, while Task 2 measures answer accuracy under different evidence granularities.

Task 1 · Mean IoU

Qwen3.5-397B-A17B

38.89%

Best average overlap with gold evidence regions.

Task 1 · IoU@0.3

Qwen3-VL-8B-SFT

57.92%

Highest proportion of predictions with IoU greater than 0.3.

Task 1 · Final accuracy

Qwen3.5-27B

51.26%

Best crop-then-answer accuracy in the grounding pipeline.

Task 2 · Whole document

Kimi-K2.5

68.02%

Best answer accuracy with full-paper input.

Task 2 · Evidence page(s)

GPT-5.2

81.39%

Best answer accuracy when only evidence pages are given.

Task 2 · Evidence crop(s)

GPT-5.2

85.15%

Best answer accuracy on cropped evidence regions.

Task 1

Grounding-Crop-then-Answer

Metrics: valid grounding output, IoU, and final answer accuracy.

Model Valid output Valid ratio Mean IoU IoU@0.3 IoU@0.5 IoU@0.7 Acc
Qwen3-VL-8B 1413 87.06% 21.43% 24.77% 12.14% 7.27% 32.04%
Qwen3-VL-32B 1593 98.15% 31.53% 44.55% 18.55% 11.95% 46.46%
Qwen3-VL-235B-A22B 1439 88.66% 25.22% 33.52% 10.72% 4.87% 38.63%
Qwen3.5-27B 1620 99.82% 36.04% 47.50% 29.39% 19.90% 51.26%
Qwen3.5-122B-A10B 1501 92.48% 30.76% 38.08% 21.87% 14.11% 46.95%
Qwen3.5-397B-A17B 1615 99.51% 38.89% 52.19% 32.96% 20.64% 50.15%
InternVL3-38B 1445 89.03% 5.23% 3.39% 0.62% 0.12% 5.05%
Ernie-5 1378 84.90% 14.57% 19.10% 9.55% 4.00% 22.12%
Kimi-K2.5 1563 96.30% 7.17% 7.16% 5.49% 2.23% 7.46%
Claude Sonnet 4.6 1523 93.84% 15.93% 17.25% 4.81% 0.68% 19.29%
GPT-5.2 1621 99.88% 31.63% 49.11% 18.55% 4.81% 41.22%
Qwen3-VL-8B-SFT 1619 99.75% 38.66% 57.92% 28.77% 17.25% 49.97%

Task 2

Evidence-Granularity-QA

Metric: overall answer accuracy on benchmark QA pairs.

Model Whole document Evidence page(s) Evidence crop(s)
Qwen3-VL-8B 9.30% 60.51% 65.87%
Qwen3-VL-32B 16.14% 73.14% 75.54%
Qwen3-VL-235B-A22B 40.30% 70.49% 75.42%
Qwen3.5-27B 42.95% 69.69% 77.02%
Qwen3.5-122B-A10B 37.58% 68.95% 73.01%
Qwen3.5-397B-A17B 36.23% 68.08% 73.75%
InternVL3-38B 10.23% 48.18% 64.14%
Ernie-5 38.51% 77.39% 81.89%
Kimi-K2.5 68.02% 79.98% 81.21%
Claude Sonnet 4.6 43.99% 68.45% 74.31%
GPT-5.2 65.43% 81.39% 85.15%
Qwen3-VL-8B-SFT 32.87% 69.43% 76.89%

Analysis Figures

What the raw result tables reveal

Task 1 grounding quality and answer accuracy analysis

In Task 1, grounding quality and final answer accuracy are tightly coupled, since the predicted bounding boxes determine the visual content provided to the downstream QA model. When the localized regions deviate from the true evidence, the cropped inputs may contain incomplete or irrelevant information, which easily leads to incorrect answers. Meanwhile, achieving high-IoU localization remains substantially harder than coarse localization, suggesting that current models still struggle to precisely capture semantically complete evidence regions in complex scientific documents.

Task 2 accuracy comparison across evidence granularities
Task 2 average accuracy and question-type analysis

In Task 2, overall QA accuracy improves as evidence becomes more focused, from document-level to page-level and finally region-level inputs. This trend indicates that reducing irrelevant context helps models concentrate on the most informative regions for reasoning. Further analysis shows that SPSR, SPMR, and MPMR form distinct difficulty bands rather than behaving like a single homogeneous QA set, with performance gradually decreasing as reasoning complexity increases.

Paper

Contributions

The paper frames SciEGQA as a missing middle ground in DocVQA: it preserves full-page scientific document reasoning while adding semantic-region evidence boxes that are complete enough to remain meaningful and precise enough to evaluate grounding. It also formalizes benchmark and training-set construction as two coordinated parts of one dataset ecosystem.

Task formulation

Question answering on scientific documents with explicit evidence pages and semantic-region bounding boxes.

Dataset scale

1,623 benchmark QA pairs plus 30,780 automatically generated training QA pairs from 3,671 papers.

Reasoning spectrum

SPSR, SPMR, and MPMR tasks cover localized, multi-region, and cross-page evidence integration.

Evaluation signals

Answer correctness and grounding quality can be evaluated together, rather than treating them as separate tasks.

Resources

Download, browse, and build on SciEGQA

Citation

BibTeX

If you use SciEGQA in your research, please cite the arXiv paper below.

@misc{yu2026sciegqadatasetscientificevidencegrounded,
      title={SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning}, 
      author={Wenhan Yu and Zhaoxi Zhang and Wang Chen and Guanqiang Qi and Weikang Li and Lei Sha and Deguo Xia and Jizhou Huang},
      year={2026},
      eprint={2511.15090},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2511.15090}, 
}