Vis-Sieve Demonstration
Important context
The public web demo (https://iszhiyangwang.github.io/MMLLA/) is populated with the VisImages dataset (≈ 35 k chart and diagram excerpts) to ensure open evaluation and repeatability.
Institutional corpora (e.g., Princeton 2022–2023, ≥ 11 k papers, ≥ 15 k figures) can be swapped into the same interface with minimum code changes.
Vis-Sieve is a pipeline and interface for surveying, annotating, and interactively exploring large-scale collections of scientific figures. The system supports visualization-service providers by enabling:
- Facility & tool planning — evidence-based decisions on software/hardware investment.
- Technique discovery — rapid browsing of exemplar charts beyond the provider’s core domain.
- Trend analysis — longitudinal views of chart-type adoption.
- Expert identification — spotting advanced visualization practitioners for collaboration or hiring.
System Overview
Phase | Key Steps | Technologies |
---|---|---|
1 · Data Acquisition | Harvest PDFs + metadata via OpenAlex → store in DuckDB | Python · Playwright |
2 · Figure Extraction | pdffigures2 ⇢ image + caption pairs; multipart detection via VisImages-Detection (Faster R-CNN) |
Java/Golang · PyTorch |
3 · Automated Annotation | Zero-shot chart-type labeling with GPT-4o-mini (image + caption prompt) | OpenAI API |
4 · Interactive Visualization | • 2 D Faceted Browser (filter/search/sort) | |
• 3 D Exploration (WebGL) | D3.js · Three.js · Observable |
2D Faceted Browser
The 2D dashboard presents a tabular view of figures with sortable columns (chart type, year, venue, etc.) and facet filters. When the demo runs on VisImages, all VisImages taxonomy classes are available for instant filtering.
3D Free-Exploration Interface
Embedding & Layout
For large collections the system first embeds each figure using CLIP features, then Treemap-inspired packing — leverages chart-type frequencies to partition space and minimize overlap.
Interaction
– Pan/zoom/rotate in WebGL.
– Hover → live thumbnail + metadata.
– Click → deep-link back to the 2 D record.
Accuracy & Efficiency Highlights
- 35 ,016 VisImages fragments auto-labeled in ≈ 90 min for $32.69 in API cost.
- Manual spot-checks (n = 300) show 91.2 % mean accuracy; confusion-matrix comparison with the native VisImages labels shows an overall 89 % agreement after multipart handling.
Reproducibility & Extensibility
- Dataset agnostic. Swap in any institution’s PDF corpus; only a config file changes.
- Modular annotation. Replace GPT-4o with any vision-language model or human-in-the-loop step.
- Scalable front-end. Tiled-montage loading keeps runtime memory low even for collections > 30 k.