This page includes several common data quality errors that you can find and fix in your dataset using some creative searching and/or existing data quality tools. Data quality issues are grouped into data categories.
For more help with data quality, see the following resources:
Problem: The same catalog number is used multiple times within your dataset. (This problem may or may not be intentional, depending on your collection’s policies. It is generally best to not duplicate catalog numbers, when possible).
Solution: Use the duplicate catalog number tool to view, edit, and/or merge duplicates. Note that only users with administrator permissions can use this tool.
Problem: The date the specimen was collected (often designated using the eventDate field) is in the future.
Solution: There are two ways you can find records with this problem:
Method 1:
Click the box and arrow icon to the right of the Symbiota ID number to open the record in a new window. That way, you will not need to re-conduct your search after editing every record.
Method 2:
Problem: The date the specimen was collected (often designated using the eventDate field) is outside the expected historical date range. The expected date range depends on the institution, but it is unlikely that most collections have specimens with dates prior to 1600.
Solution: See the methods described in the Date Hasn’t Happened Yet section, but do the following modifications:
Method 1:
In step 2, select “ascending” in the second dropdown list in the Sort By field. You will also likely want to filter out records without dates by selecting Date from Custom Field 1 and selecting IS NOT NULL from the second dropdown list. This will remove any blank dates from your search results, which would normally show up at the beginning of your ascending list.
Method 2:
In step 3, enter 0001-01-01 in the first Collection Date field (or the Collection Start Date field) and the earliest date you think would be possible in your collection (e.g., 1700-01-01) in the second field (or the Collection End Date field).
Problem: The date the specimen was identified (dateIdentified field) is earlier than the date the specimen was collected (eventDate).
Solution: This problem cannot be identified using Symbiota portal tools. To locate records with this issue, download your data from the public search page, as a backup file, or using the exporter tool. You can then use a spreadsheet program to compare the dateIdentified to the eventDate field (see Excel instructions here).
Problem: The event year, month, and day values do not match the provided event date. The event date is often the date of collection for preserved specimens.
Solution: This problem cannot be identified using Symbiota portal tools. To locate records with this issue, download your data from the public search page, as a backup file, or using the exporter tool. You can then use a spreadsheet program to compare the dateIdentified to the eventDate field (see Excel instructions here).
Problem: The provided latitude and/or longitude values are 0.
Solution:
Problem: The provided coordinates do not fall within the geographic boundaries of the named country, state, and/or county.
Solution: The problem cannot currently be identified using Symbiota portal tools. We recommend using the GBIF Reverse Geocoding API to verify coordinate-country matching, or by simply publishing your data to GBIF and viewing the data quality flags of your dataset.
Problem: Metadata fields regarding coordinates, such as coordinateUncertaintyInMeters, georeferenceProtocol, georeferenceSources, georeferencedBy, georeferenceRemarks, and geodeticDatum are provided, but no coordinates are present. This is sometimes intentional, particularly when georeferencedBy and georeferencedRemarks are used to indicate whether a record was purposefully not georeferenced. However, it is rare that the other metadata fields can be used without associated coordinates (i.e., decimalLatitude, decimalLongitude, or verbatimCoordinates).
Problem:
Click the box and arrow icon to the right of the Symbiota ID number to open the record in a new window. That way, you will not need to re-conduct your search after editing every record.
Problem: Elevation values are either too high (>17000 m) or too low (-11000 m) to occur on Earth.
Solution:
Problem: The sign of the latitude (decimalLatitude) or longitude (decimalLongitude) does not match the sign/hemisphere of the given country. For example, all longitudes in the U.S. should be negative.
Solution: The problem cannot currently be identified using Symbiota portal tools. We recommend using the GBIF Reverse Geocoding API to verify coordinate-country matching, or by simply publishing your data to GBIF and viewing the data quality flags of your dataset.
Problem: Coordinates deviate from accepted ranges or formats, like decimalLatitude and decimalLongitude exceeding -90 to 90 and -180 to 180, respectively. verbatimCoordinates have to be valid values for coordinates in decimal degrees, degrees decimal minutes, degrees minutes second.
Solution: Some types of invalid coordinates can be identified using the Record Search Form.
Problem: Lower geography (e.g., county, state/province) values exist, but no higher geography values (e.g., country) are provided.
Solution: This issue can be quickly identified and fixed using the geography cleaning tools. Note that you must have administrator permissions to use these tools.
Problem: The minimum elevation (minimumElevationInMeters) has a greater value than the maximum elevation (maximumElevationInMeters).
Solution: This problem cannot currently be identified using Symbiota portal tools. To locate records with this issue, download your data from the public search page, as a backup file, or using the exporter tool. You can then use a spreadsheet program to compare the minimumElevationInMeters to the maximumElevationInMeters field (see Excel instructions here).
Problem: The provided value for country and countryCode do not match.
Solution: This problem cannot currently be identified using Symbiota portal tools. To locate records with this issue, download your data from the public search page, as a backup file, or using the exporter tool. You can then use a spreadsheet program to compare the unique combinations of country and countryCode to look for deviations (see Excel instructions here).
Problem: Geodetic datum is a key piece of a properly georeferenced specimen, but is usually left blank. Although it is commonly assumed to be in ‘WGS84’, this should be added and noted as such.
Solution:
Problem: A record has lower geographic terms (e.g., state/province, county) that do not exist under the provided higher geographic term(s). For example, country = Canada and stateProvince = Sussex. There is no Sussex province in Canada.
Solution: This issue can be quickly identified and fixed using the geography cleaning tools. Note that you must have administrator permissions to use these tools.
Problem: A record has a latitude value, but not a longitude value, or vice versa.
Solution:
Problem: The geographic units (e.g., country, state/province, county) are misspelled, resulting in poor matching of geographic unit names to existing geographic lists.
Solution: This issue can be quickly identified and fixed using the geography cleaning tools. Note that you must have administrator permissions to use these tools.
Problem: Scientific names are misspelled, resulting in poor matching of taxonomic names to taxonomic databases.
Solution: This issue can be quickly identified and fixed using the taxonomic cleaning tools. Note that you must have administrator permissions to use these tools.
Problem: Species may be missing higher taxonomic information.
Solution: This is only a problem in Symbiota portals if you have scientific names that are not included in the taxonomic thesaurus. You can use the [taxonomic cleaning tools]](https://biokic.github.io/symbiota-docs/coll_manager/data_cleaning/taxonomy/) to automatically import names from Catalog of Life or other resources into the thesaurus (your ability to do this depends on the portal), or contact your portal administrator about adding missing names to the thesaurus.
Problem: Data inconsistencies arise when incorrect character encodings are used during data manipulation or transfer. This issue occurs when datasets are opened, downloaded, or imported across different software platforms, leading to misinterpretation and garbled text. For instance, special characters like accents or symbols may be rendered incorrectly, affecting the readability and accuracy of the data. (e.g., Carl Linné).
Solution: There is no cross-field searching tools that would enable you to find mis-rendered symbols across all fields, but you can search certain fields. For example:
Problem: When transferring text files between Unix/Linux and DOS/Windows systems, line endings can become inconsistent. Unix/Linux systems typically use line feed (LF) characters, while DOS/Windows systems use carriage return (CR) and line feed (LF) combinations. This mismatch can result in extra characters appearing in the data, causing visual artifacts and processing errors.
Solution: This is unlikely to be a problem for data that has already been imported into a Symbiota portal. It is possible that erroneous (¶) symbols will be retained. In this case:
Problem: individualCount values may not make sense as a positive integer.
Solution:
Alternatively, you can just scan through the table where Individual Count IS NOT NULL and look for discrepancies by eye.
Problem: Values in the BasisOfRecord field do not match the recommended controlled vocabulary. While using standardized terms in this field is not strictly necessary, doing so does improve the discoverability and interoperability of your data.
The currently accepted values for BasisOfRecord include: MaterialEntity, PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation.
Note that even punctuation and capitalization differences in these values (e.g., Preserved Specimen) are discouraged.
Solution:
Katie Pearson. Data Quality Toolkit. In: Symbiota Support Hub (2024). Symbiota Documentation. https://biokic.github.io/symbiota-docs/editor/quality/. Created on 20 Mar 2024, last edited on 19 Jul 2024.