Baseline, F1 Score, black box, ablation, diagnostic, extrinsic/intrinsic, performance, annotation, metrics, human-based, test suite... Terms like these constantly show up in NLP papers, books and codes. What do they have in common? They are related to the evaluation process of systems. An adequate and fair evaluation is an essential step when building, analyzing and comparing models or algorithms. In this course, we will cover the main aspects of current Machine Learning evaluation methods and how the NLP community has been adapting them to the specific needs of different NLP tasks.