Parsing Legislative Data from Selangor State Assembly

Technical article

The State Assembly of Selangor offers open access to the legislative data, but they are in PDF. While finding an application to read them is trivial, there isn't much else we can do with them. Every time one needs to refer a part of the document, the person would have to go through everything again. Needless to say, it also means there is no possibility of inter-operability.

We started by crawling the website, and download the published legislative data. Then, after going through the documents, we draft a schema on how we should structure the data. One of the objectives of this exercise is to also publishing the data into AkomaNtoso schema. The simplified schema for Hansard and Inquiry data would look as follows:

There are many Python libraries that are capable of extracting data from PDF. After some research and experiment, we chose unstructured. It is one of the few libraries that is able to extract both text and tabular data in one pass. In the code, we also opted to extract tabular data as images.

The reason to extract tabular data to image is to allow us to index each snippet to be searchable. The extracted text is often impossible to replicate back into a table. Thus, keeping a reference to allow switching between the two forms is helpful.

We will discuss the switching of two forms via a web interface in another report.

The code snippet above would result in a list of objects containing text. When applicable, it may also contain an image representation of the extracted text. While the library works fine most of the time, there are many cases it may not be perfect. For instance

Instead of parsing it as text, the library treats it as a table. Thus, the order of the text is wrong, instead of

Y.B. PUAN DR. DAROYAH BINTI ALWI: Terima kasih Tuan Speaker, soalan Sementa, No. 4.

We have instead

Y.B. PUAN DR. DAROYAH BINTI ALWI soalan Sementa, No. 4. :

Retaining the image representation becomes crucial in this context. While the extraction is imperfect, it still offers enough information for indexing.

The lists of extracted objects, returned by the library, were then cached and saved to the archive. These pickle files are the mentioned objects when one browses the data repository.

Parsing logic often requires many revisions, but extraction often returns consistent results. Each extraction run also takes quite some time, depending on the length of the document. This is why we save the pickled file too.

Then we went through the objects in sequence, and constructed the desired records. We then generated an XML markup based on the AkomaNtoso schema for each record. Finally, we serialized the data into JSON, which is a popular, and readable data format.

It is obvious that the process could be simpler if the legislative data is in structured form to begin with. As mentioned, AkomaNtoso is a well-defined schema and made available for years. It is also possible to publish the data in other different format and schema too, as long as they are open and public.