Largescale Complex Question Answering Dataset

What is LC-QuAD?

The aim of LC-QuAD is to make a large dataset for Question Answering (QA) over structured data (in RDF format) available. It consists of 5000 pairs of natural language question and the corresponding SPARQL query. In order to create the dataset, we used a set of typical query templates and then converted seed entities in the RDF graph to a normalised natural natural question structure (NNQS). This was then transformed to natural language questions with different lexical and syntactical variations by English native speakers. Please see our paper for more details.

Documentation & Usage Guides

The documentation of the project will soon be made available on our repository's wiki.
We're working round the clock to get it up!
Every data item in the dataset consists of the following fields:

      template_id: "Every unique SPARQL template has a different ID.",
      sparql_template: "A query where resources in the triple pattern  are replaced with placeholders.",
      sparql_query: "Valid SPARQL query generated by using the triples in subgraphs to fill the placeholder 
resources in SPARQL Templates.
", verbalized_question: "The automatically verbalized equivalent of the SPARQL Query.", corrected_question: "Human corrected version of the verbalized question.", _id: "Unique ID generated for every data node." }