Data cleaning: Using MySQL to identify XML breaking characters
Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records. Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table. Even if you have time and are meticulous, you could miss certain control characters because they are invisible...