The performance of question answering system is evaluated through successive evaluations campaigns. A set of questions are given to the participating systems which are to find the correct answer in a collection of documents. The creation process of the questions may change from one evaluation to the next. This may entail an uncontroled question difficulty shift. For the QAst 2009 evaluation campaign, a new procedure was adopted to build the questions. Comparing results of QAst 2008 and QAst 2009 evaluations, a strong performance loss could be measured in 2009 for French and English, while the Spanish systems globally made progress. The measured loss might be related to this new way of elaborating questions. The general purpose of this paper is to propose a measure to calibrate the difficulty of a question set. In particular, a reasonable measure should output higher values for 2009 than for 2008. The proposed measure relies on a distance measure between the critical elements of a question and those of the associated correct answer. An increase of the proposed distance measure for French and English 2009 evaluations as compared to 2008 could be established. This increase correlates with the previously observed degraded performances. We conclude on the potential of this evaluation criterion: the importance of such a measure for the elaboration of new question corpora for questions answering systems and a tool to control the level of difficulty for successive evaluation campaigns.