The Development of the Malaysian Hansard Corpus: A Corpus of Parliamentary Debates 1959-2020

  • Imran Ho Abdullah
  • Anis Nadiah Che Abdul Rahman
  • Azhar Jaludin
Keywords: Corpus development, corpus linguistics, parliamentary corpus, Malaysian Hansard, Malaysian Hansard Corpus


Parliamentary corpora are pertinent language resources of various subject matters. The applications of parliamentary corpora enable researchers to analyse data from different perspectives including linguistics, political sciences, computational linguistics or history. The availability of parliamentary corpora in specific languages facilitates towards the analysis of that particular language. In Malaysia, parliamentary corpus has not been developed and Malaysian Hansard portal could not be comprehensively used to analyse linguistics patterns, semantic shift or discourse analysis. Malaysian Hansard Corpus (MHC) was initially created to provide a comprehensive and digitally accessible parliamentary documents written in the Malay and English language,
compiled from Malaysian Parliamentary Reports (Malaysian Hansard). MHC contains documents from the House of Representative (Dewan Rakyat) in Malaysian Parliament. The dates of the documents ranges from Parliament 1 (1959) to the present’s Parliament 14 (2019). This initial version of digital collection comprises of 167,513,039 tokens (word count) gathered from 3,684 scanned files in PDF format. The corpus is readily accessible for public use in txt and PDF formats and has its own distinctive typology. The availability of this corpus will promote more statistical analyses and hypothesis testing, enable the testing of occurrences, frequencies or validating the rules of linguistic within an explicit language dominion.