Dataset Page

SciEGQA

A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning

Wenhan Yu · Zhaoxi Zhang · Wang Chen · Guanqiang Qi · Weikang Li · Lei Sha · Deguo Xia · Jizhou Huang

SciEGQA targets a practical but under-explored setting: answering questions on scientific papers while explicitly grounding each answer in the visual evidence regions that support it. Each sample pairs the query and answer with evidence pages, semantic-region bounding boxes, document metadata, and multiple converted views for answer generation and grounding evaluation.

Semantic Region Grounding Scientific DocVQA Benchmark + Training set Localization + Answer evaluation

arXiv Paper GitHub Benchmark HF Training HF

1,623

Benchmark #QA

30,780

Training Set #QA

arXiv categories

Question types

Paper Preview

Benchmark

80 papers

1,941 pages

Training Set

3,671 papers

42,380 pages

Annotation

12 annotators

24 domain experts

Evidence

Evidence doc and page(s)

Evidence region(s)

Overview

Page-level QA with semantic-region evidence

Existing document QA datasets usually fall into three buckets: page-level QA without precise grounding, component-focused QA on isolated charts or tables, or token-level supervision that is precise but often too fragmented to preserve semantic completeness. SciEGQA is designed to close that gap.

The dataset keeps the question answering setup close to realistic scientific document understanding while attaching semantic evidence regions as bounding boxes. This makes it possible to evaluate both whether a model answers correctly and whether it actually found the right visual support.

Evidence Grounding Multi-View Data Evaluation Metrics

SciEGQA is positioned between coarse page-level QA and overly fragmented token-level grounding by annotating semantically complete evidence regions.

At A Glance

Two complementary dataset components

Human annotated
SciEGQA Benchmark

A fine-grained benchmark built from 80 arXiv papers with manual semantic-region annotations for rigorous evaluation of scientific document understanding and evidence grounding.

1,623 QA samples
1,941 pages
80 papers
Human fine-grained annotated

Automatically generated
SciEGQA Training Set

A large-scale training resource constructed through a PDF-to-region-to-QA pipeline, designed for scalable model training and extensible dataset generation.

30,780 QA samples
42,380 pages
3,671 arXiv papers
Supports automatic data expansion

Data format

Overview of the raw data formats.

{
  "query": "...",
  "answer": "...",
  "evidence_page": [x],
  "bbox": [[[x1, y1, x2, y2]]],
  "rel_bbox": [[[rel_x1, rel_y1, rel_x2, rel_y2]]],
  "subimg_type": [["category"]],
  "doc_name": "xxxxxx",
  "category": "xxx"
}

Construction

From PDF pages to evidence grounded QA

The SciEGQA dataset consists of a fine-grained benchmark and a large-scale automatically constructed training set.

Human annotated
SciEGQA Benchmark

Region Annotation

Cross Validation

Expert QA Design

Quality Assurance

Automatically generated
SciEGQA Training Set

Regions Segmentation

Regions Filtering

QA Generation

Answer Verification

Examples

Three types of Questions

SPSR

Single Page Single Region

Localized reasoning from one semantic evidence region on one page.

Benchmark46.15%

Training37.91%

SPMR

Single Page Multi Regions

Reasoning across multiple semantically related regions within the same page.

Benchmark34.26%

Training24.41%

MPMR

Multi Pages Multi Regions

Cross-page evidence integration over multiple regions, the highest complexity.

Benchmark19.59%

Training37.69%

Single Page Single Region

In the identities table, what AI entry corresponds to the DS term "rational"?

Answer: machine

Evidence Page: [26]

Evidence Regions: [[[305, 1512, 1520, 2005]]]

Evidence Categories: [Table]

SPMR example with image and text evidence

Single Page Multi Regions

Using the axis descriptions and the wheel diagram, which segment labels mark the ends of the vertical "how vs why" axis and the horizontal "supporting practices" axis?

Answer: Vertical: Value and Analytics; Horizontal: Systems and Design.

Evidence Page: [11]

Evidence Regions: [[[637, 302, 1887, 1614], [293, 1661, 2216, 1999]]]

Evidence Categories: [Image, Text]

Multi Pages Multi Regions

In Table 1, the Machine-Concrete quadrant shows which primed numeral, and in Figure 2 what three left-column actions correspond to that numeral?

Answer: III'; clean, prepare, explore

Evidence Pages: [9, 5]

Evidence Regions: [[[719, 1244, 1796, 1770]], [[593, 783, 1943, 1415]]]

Evidence Categories: [Table, Image]

Statistics

Coverage, Diversity, and Evaluation signals

arXiv category distribution

Benchmark papers are evenly sampled with 10 papers per category. The training set scales unevenly across eight domains.

q-fin785 train / 10 bench

q-bio670 train / 10 bench

econ576 train / 10 bench

cs410 train / 10 bench

stat353 train / 10 bench

physics337 train / 10 bench

eess288 train / 10 bench

math252 train / 10 bench

The paper reports dataset statistics over category-modality combinations, question and answer lengths, spatial evidence locations, and evidence type distributions for both benchmark and training data.

Task

Definitions of Two Evaluation Tasks

We design two evaluation tasks on the SciEGQA benchmark and perform zero-shot inference to systematically evaluate a diverse set of Vision-Language Models (VLMs).

Grounding-Crop-then-Answer

Given the evidence page(s) and the corresponding question, the model predicts a set of bounding boxes that localize the regions relevant to answering the question. Each box is represented as (x1, y1, x2, y2), with coordinates normalized to the page coordinate space of [0,1000].

Metric: IoU

Metric: Answer accuracy

Predict bounding boxes

Answer by crop predicted region(s)

Evaluation pipeline

1. Grounding: the grounding model predicts one or more answer-relevant bounding boxes.
2. Crop: crop the page image to obtain the corresponding predicted regions.
3. Answer: use Qwen3.5-27B as the fixed generation model to produce the final answer.

QA under Different Evidence Granularities

We evaluate model performance under three levels of input granularity: the entire document, the evidence page(s), and the cropped evidence region(s). By comparing answer accuracy across these settings, we analyze how evidence granularity influences the question answering capability of VLMs.

Input 1

Entire document

Whole document images

All pages of the source paper are provided to the model for direct answer generation.

Input 2

Evidence page(s)

Answer from evidence pages

Only the page or pages that support the answer evidence are given as input.

Input 3

Cropped evidence region(s)

Answer from evidence regions

The model receives only the cropped evidence region or regions annotated for the QA pair.

Main Results

Top-line benchmark performance

The tables below summarize the main benchmark numbers. Task 1 evaluates both grounding quality and crop-then-answer performance, while Task 2 measures answer accuracy under different evidence granularities.

Task 1 · Mean IoU

Qwen3.5-397B-A17B

38.89%

Best average overlap with gold evidence regions.

Task 1 · IoU@0.3

Qwen3-VL-8B-SFT

57.92%

Highest proportion of predictions with IoU greater than 0.3.

Task 1 · Final accuracy

Qwen3.5-27B

51.26%

Best crop-then-answer accuracy in the grounding pipeline.

Task 2 · Whole document

Kimi-K2.5

68.02%

Best answer accuracy with full-paper input.

Task 2 · Evidence page(s)

GPT-5.2

81.39%

Best answer accuracy when only evidence pages are given.

Task 2 · Evidence crop(s)

GPT-5.2

85.15%

Best answer accuracy on cropped evidence regions.

Task 1

Grounding-Crop-then-Answer

Metrics: valid grounding output, IoU, and final answer accuracy.

Model	Valid output	Valid ratio	Mean IoU	IoU@0.3	IoU@0.5	IoU@0.7	Acc
Qwen3-VL-8B	1413	87.06%	21.43%	24.77%	12.14%	7.27%	32.04%
Qwen3-VL-32B	1593	98.15%	31.53%	44.55%	18.55%	11.95%	46.46%
Qwen3-VL-235B-A22B	1439	88.66%	25.22%	33.52%	10.72%	4.87%	38.63%
Qwen3.5-27B	1620	99.82%	36.04%	47.50%	29.39%	19.90%	51.26%
Qwen3.5-122B-A10B	1501	92.48%	30.76%	38.08%	21.87%	14.11%	46.95%
Qwen3.5-397B-A17B	1615	99.51%	38.89%	52.19%	32.96%	20.64%	50.15%
InternVL3-38B	1445	89.03%	5.23%	3.39%	0.62%	0.12%	5.05%
Ernie-5	1378	84.90%	14.57%	19.10%	9.55%	4.00%	22.12%
Kimi-K2.5	1563	96.30%	7.17%	7.16%	5.49%	2.23%	7.46%
Claude Sonnet 4.6	1523	93.84%	15.93%	17.25%	4.81%	0.68%	19.29%
GPT-5.2	1621	99.88%	31.63%	49.11%	18.55%	4.81%	41.22%
Qwen3-VL-8B-SFT	1619	99.75%	38.66%	57.92%	28.77%	17.25%	49.97%

Task 2

Evidence-Granularity-QA

Metric: overall answer accuracy on benchmark QA pairs.

Model	Whole document	Evidence page(s)	Evidence crop(s)
Qwen3-VL-8B	9.30%	60.51%	65.87%
Qwen3-VL-32B	16.14%	73.14%	75.54%
Qwen3-VL-235B-A22B	40.30%	70.49%	75.42%
Qwen3.5-27B	42.95%	69.69%	77.02%
Qwen3.5-122B-A10B	37.58%	68.95%	73.01%
Qwen3.5-397B-A17B	36.23%	68.08%	73.75%
InternVL3-38B	10.23%	48.18%	64.14%
Ernie-5	38.51%	77.39%	81.89%
Kimi-K2.5	68.02%	79.98%	81.21%
Claude Sonnet 4.6	43.99%	68.45%	74.31%
GPT-5.2	65.43%	81.39%	85.15%
Qwen3-VL-8B-SFT	32.87%	69.43%	76.89%

Analysis Figures

What the raw result tables reveal

Task 1 grounding quality and answer accuracy analysis

In Task 1, grounding quality and final answer accuracy are tightly coupled, since the predicted bounding boxes determine the visual content provided to the downstream QA model. When the localized regions deviate from the true evidence, the cropped inputs may contain incomplete or irrelevant information, which easily leads to incorrect answers. Meanwhile, achieving high-IoU localization remains substantially harder than coarse localization, suggesting that current models still struggle to precisely capture semantically complete evidence regions in complex scientific documents.

Task 2 accuracy comparison across evidence granularities

Task 2 average accuracy and question-type analysis

In Task 2, overall QA accuracy improves as evidence becomes more focused, from document-level to page-level and finally region-level inputs. This trend indicates that reducing irrelevant context helps models concentrate on the most informative regions for reasoning. Further analysis shows that SPSR, SPMR, and MPMR form distinct difficulty bands rather than behaving like a single homogeneous QA set, with performance gradually decreasing as reasoning complexity increases.

Open on arXiv Code Repository

Paper

Contributions

The paper frames SciEGQA as a missing middle ground in DocVQA: it preserves full-page scientific document reasoning while adding semantic-region evidence boxes that are complete enough to remain meaningful and precise enough to evaluate grounding. It also formalizes benchmark and training-set construction as two coordinated parts of one dataset ecosystem.

Task formulation

Question answering on scientific documents with explicit evidence pages and semantic-region bounding boxes.

Dataset scale

1,623 benchmark QA pairs plus 30,780 automatically generated training QA pairs from 3,671 papers.

Reasoning spectrum

SPSR, SPMR, and MPMR tasks cover localized, multi-region, and cross-page evidence integration.

Evaluation signals

Answer correctness and grounding quality can be evaluated together, rather than treating them as separate tasks.

Resources

BibTeX

If you use SciEGQA in your research, please cite the arXiv paper below.

@misc{yu2026sciegqadatasetscientificevidencegrounded,
      title={SciEGQA: A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning}, 
      author={Wenhan Yu and Zhaoxi Zhang and Wang Chen and Guanqiang Qi and Weikang Li and Lei Sha and Deguo Xia and Jizhou Huang},
      year={2026},
      eprint={2511.15090},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2511.15090}, 
}

SciEGQA

A Dataset for Scientific Evidence-Grounded Question Answering and Reasoning

Page-level QA with semantic-region evidence

Two complementary dataset components

From PDF pages to evidence grounded QA

Region Annotation

Cross Validation

Expert QA Design

Quality Assurance

Regions Segmentation

Regions Filtering

QA Generation

Answer Verification

Three types of Questions

SPSR

SPMR

MPMR

Coverage, Diversity, and Evaluation signals

arXiv category distribution

Definitions of Two Evaluation Tasks

Grounding-Crop-then-Answer

QA under Different Evidence Granularities

Whole document images

Answer from evidence pages

Answer from evidence regions

Top-line benchmark performance

Grounding-Crop-then-Answer

Evidence-Granularity-QA

What the raw result tables reveal

Contributions

Task formulation

Dataset scale

Reasoning spectrum

Evaluation signals

Download, browse, and build on SciEGQA

GitHub Repository

Benchmark on Hugging Face

Training Set on Hugging Face

arXiv Paper

BibTeX