Gradio

Rank	Method	Execution Accuracy
10	Human Performance (Handwritten LOTUS Llama-3.1-70B)	65

Rank	Method	Execution Accuracy
	Human Performance (Handwritten LOTUS o3-mini)	65.000000
	Human Performance (Handwritten LOTUS Llama-3.1-70B)	55.000000
	Human Performance (Handwritten LOTUS GPT-4o)	55.000000
1	Zero-shot Text2SQL + LM Generation (o3-mini)	30.000000
2	Zero-shot Text2SQL (GPT-4o)	18.000000
3	Zero-shot Text2SQL (o3-mini)	18.000000
4	Zero-shot Text2SQL (Llama-3.1-70B)	17.000000
5	Zero-shot Text2SQL + LM Generation (GPT-4o)	15.000000
6	Zero-shot Text2SQL + LM Generation (Llama-3.1-70B)	13.000000
7	Zero-shot Text2SQL (Deepseek-R1)	12.000000
8	RAG (E5) + LM Rerank (o3-mini)	7.000000
9	RAG (E5 + o3-mini)	7.000000
10	RAG (E5 + GPT-4o)	3.000000
11	RAG (E5) + LM Rerank (GPT-4o)	3.000000
12	RAG (E5) + LM Rerank (Llama-3.1-70B)	2.000000
13	RAG (E5 + Llama-3.1-70B)	0.000000
14	Zero-shot Text2SQL + LM Generation (Deepseek-R1)	0.000000

What does the TAG leaderboard evaluate?

In this leaderboard, you'll find execution accuracy comparisons of table question answering approaches on TAG-Bench. TAG-Bench contains complex queries requiring world knowledge or semantic reasoning that goes beyond the information explicitly available in the database.

How is accuracy measured?

Execution accuracy is measured as the number of exact matches to our annotated ground truth answers which are hand-labeled by experts.

Citation

@misc{biswal2024text2sqlenoughunifyingai,
      title={Text2SQL is Not Enough: Unifying AI and Databases with TAG}, 
      author={Asim Biswal and Liana Patel and Siddarth Jha and Amog Kamsetty and Shu Liu and Joseph E. Gonzalez and Carlos Guestrin and Matei Zaharia},
      year=2024,
      eprint=2408.14717,
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2408.14717}, 
    }

Ensure the following files are included in your submission:

output.json: File containing the evaluation outputs generated by your model. Please refer to [] for format instructions.
requirements.txt: A list of dependencies needed to run your model or script.
README.md: A detailed description of your submission, including:
- Purpose and overview of the submission.
- Instructions to reproduce the results.
- Any additional notes for evaluators.
Model/Keys: Upload your models or API keys to Hugging Face if they are not publicly accessible.

Note: Submissions missing any of these materials will not be processed.

Submissions are accepted once a month to ensure sufficient evaluation bandwidth.
Plan your submission timeline accordingly to avoid delays.

Follow these steps to upload your materials:

Compress all files in the code into a single .zip file, or provide a public repository to refer to.
Email the .zip file or repositoty link to our email tagbenchmark@gmail.com.

After uploading your materials:

Provide accurate contact information for follow-ups.
Double-check your materials for completeness to avoid processing delays.

Important: Your submission will be added to the evaluation queue. Depending on the queue size, evaluations may take up to a few weeks.

For further assistance, reach out to tagbenchmark@gmail.com with questions.

TAG Leaderboard

What does the TAG leaderboard evaluate?

How is accuracy measured?

Citation

After uploading your materials: