Reinhard Simon, International Potato Center, Lima, Peru
The library datacheck provides some simple functions to check the consistency of a dataset. It assumes data are available in tabular format - typically a csv file with objects or records in rows and attributes or variables in the columns.
In a database setting the variables would be controlled by the database - at least conformance to types (character, numeric, etc) and allowed min/maximum values. However, often data are gathered in simple spreadsheets or are for other reasons without such constraints. Here, data constraints like allowed types or values, expected values and relationships can be defined using R commands and syntax. This allows much more flexibility and fine grained control. Typically it demands also a lot of domain knowledge from the user. It is therefore often useful to re-use such domain aware rule files across tables with similar content. Therefore this tool is foregiving if rules cannot be executed if a variable is not present in the table to be analyzed allowing the reuse of such rule files.
Use the following commands to copy some example files to your current working directory (uncomment the file.copy command):
atable = system.file("examples/soilsamples.csv", package = "datacheck")
srules = system.file("examples/soil_rules.R", package = "datacheck")
# Uncomment the next two lines
# file.copy(atable, 'soilsamples.csv') file.copy(srules, 'soil_rules.R')
Then type in the command runDatacheck() in the R editor.
Use the upload buttons to load the respective files in your working directory. Review the results.
Assuming you have copied the above mentioned files in your working directory proceed to read in the data.
atable = read.csv(atable, header = TRUE, stringsAsFactors = FALSE)
srules = read.rules(srules)
profil = datadict.profile(atable, srules)
You can inspect a graphical summary of rules per variable:
ruleCoverage(profil)
 
The cumulative number of records with increasing scores.
scoreSum(profil)
 
Or see the tables (only the first 20 records and first 6 columns shown):
xtable(atable[1:20, 1:6])
| ID | Latitude | Longitude | Country | Adm1 | Adm2 | |
|---|---|---|---|---|---|---|
| 1 | 1 | -7.48 | -78.97 | Peru | Cajamarca | Contumazá | 
| 2 | 2 | -7.48 | -78.97 | Peru | Cajamarca | Contumazá | 
| 3 | 3 | -7.48 | -78.97 | Peru | Cajamarca | Contumazá | 
| 4 | 4 | -18.18 | -70.47 | Peru | Tacna | Tacna | 
| 5 | 5 | -12.26 | -75.07 | Peru | Huancavelica | Tayacaja | 
| 6 | 6 | -12.26 | -75.07 | Peru | Huancavelica | Tayacaja | 
| 7 | 7 | -12.24 | -75.05 | Peru | Huancavelica | Tayacaja | 
| 8 | 8 | -12.24 | -75.05 | Peru | Huancavelica | Tayacaja | 
| 9 | 9 | -12.08 | -76.95 | Peru | Lima | Lima | 
| 10 | 10 | -12.08 | -76.95 | Peru | Lima | Lima | 
| 11 | 11 | -12.03 | -75.24 | Peru | Junin | Huancayo | 
| 12 | 12 | -11.13 | -75.36 | Peru | Junin | Chanchamayo | 
| 13 | 13 | -10.58 | -75.40 | Peru | Pasco | Oxapampa | 
| 14 | 14 | -9.10 | -76.59 | Peru | Huanuco | Huacaybamba | 
| 15 | 15 | -5.89 | -76.11 | Peru | Loreto | Alto Amazonas | 
| 16 | 16 | -3.80 | -73.32 | Peru | Loreto | Maynas | 
| 17 | 17 | -3.80 | -73.32 | Peru | Loreto | Maynas | 
| 18 | 18 | -3.80 | -73.32 | Peru | Loreto | Maynas | 
| 19 | 19 | -3.80 | -73.32 | Peru | Loreto | Maynas | 
| 20 | 20 | -3.80 | -73.32 | Peru | Loreto | Maynas | 
Similarly for the score table; however, this table contains also the total counts of scores by records and variables. In addition, the maximum score by variable.
ps = profil$scores
recs = c(1:10, nrow(ps) - 1, nrow(ps))
cols = c(1:4, ncol(ps))
xtable(ps[recs, cols])
| ID | Latitude | Longitude | Country | Record.score | |
|---|---|---|---|---|---|
| 1 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 2 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 3 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 4 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 5 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 6 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 7 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 8 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 9 | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
| 10 | 3.00 | 2.00 | 3.00 | 2.00 | 31.00 | 
| Attribute.score | 5259.00 | 3490.00 | 5243.00 | 3506.00 | 61055.00 | 
| Rules.per.variable | 3.00 | 2.00 | 3.00 | 2.00 | 35.00 | 
A last visualization is a heatmap of the score table to organize similar records and similar rule profiles to help detect any patterns,
 
For comparative purposes we purposely introduce a few errors in our table as below. We also exclude a rule on soil types for better display.
atable$P[1] = -100
atable$pH[11] = -200
srule1 = srules[-c(33), ]
profil = datadict.profile(atable, srule1)
To get a better handle on the data it is always informative to review simple descriptive summaries of the data. A custom summary function is included in the package to display this summary in tabular form:
xtable(shortSummary(atable))
| n | missing | unique | value | min | max | Mean | sd | .05 | .10 | .25 | .50 | .75 | .90 | .95 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1753 | 0 | 1753 | 1 | 1753 | 877 | 506.19 | 88.6 | 176.2 | 439.0 | 877.0 | 1315.0 | 1577.8 | 1665.4 | |
| Latitude | 1737 | 16 | 168 | -18.2976 | -3.6159 | -12.28 | 3.23 | -18.182 | -15.884 | -15.833 | -12.070 | -11.130 | -7.157 | -5.894 | |
| Longitude | 1737 | 16 | 169 | -80.823 | -69.0654 | -74.6 | 2.71 | -77.61 | -76.95 | -76.95 | -75.35 | -72.10 | -70.03 | -70.03 | |
| Country | 1753 | 0 | 1 | Peru | |||||||||||
| Adm1 | 1753 | 0 | 22 | ||||||||||||
| Adm2 | 1743 | 10 | 58 | ||||||||||||
| Adm3 | 1738 | 15 | 110 | ||||||||||||
| pH | 1751 | 2 | 333 | -200 | 10.46 | 6.361 | 5.12 | 4.100 | 4.510 | 5.200 | 7.100 | 7.600 | 7.900 | 8.185 | |
| Conductivity | 1752 | 1 | 443 | 0.02 | 42.4 | 1.571 | 2.51 | 0.08 | 0.11 | 0.21 | 0.52 | 2.34 | 3.95 | 5.25 | |
| CaCO3 | 1752 | 1 | 190 | 0 | 94.8 | 1.722 | 6.69 | 0.000 | 0.000 | 0.000 | 0.000 | 0.380 | 3.393 | 10.100 | |
| Organic_matter | 1752 | 1 | 370 | 0.03 | 50.9 | 2.336 | 3.71 | 0.290 | 0.500 | 0.890 | 1.500 | 2.400 | 4.210 | 6.918 | |
| P | 1752 | 1 | 453 | -100 | 503.7 | 19.36 | 25.57 | 2.80 | 3.70 | 6.00 | 13.20 | 22.30 | 46.48 | 58.90 | |
| Sand | 1687 | 66 | 73 | 0 | 100 | 54.97 | 16.29 | 26.0 | 33.2 | 46.0 | 54.0 | 66.0 | 76.0 | 82.0 | |
| Lime | 1686 | 67 | 58 | 0 | 74 | 29.01 | 9.82 | 12 | 18 | 24 | 28 | 35 | 40 | 44 | |
| Clay | 1686 | 67 | 54 | 0 | 76 | 16.01 | 10.25 | 2 | 4 | 8 | 16 | 20 | 28 | 36 | |
| Soil_texture | 1692 | 61 | 12 | ||||||||||||
| Altitude | 1753 | 0 | 157 | -9999 | 4417 | 1661 | 1940.9 | 78 | 78 | 235 | 839 | 3299 | 3846 | 3846 | 
A summary of the results by rule can be seen from the profil object:
xtable(profil$checks)
| Variable | Type | Rule | Comment | Execution | Error.sum | Error.list | |
|---|---|---|---|---|---|---|---|
| 1 | ID | integer | sapply(ID, is.integer) | None | ok | 0 | none | 
| 2 | ID | integer | !duplicated(ID) | None | ok | 0 | none | 
| 3 | ID | integer | ID > 0 & ID < 1754 | None | ok | 0 | none | 
| 4 | Latitude | numeric | sapply(Latitude, is.numeric) | None | ok | 0 | none | 
| 5 | Latitude | numeric | Latitude < 0 | None | ok | 0 | none | 
| 6 | Longitude | numeric | sapply(Longitude, is.numeric) | None | ok | 0 | none | 
| 7 | Longitude | numeric | Longitude < 180 & Longitude > -180 | None | ok | 0 | none | 
| 8 | Longitude | numeric | is.null(Longitude) == is.null(Latitude) | None | ok | 0 | none | 
| 9 | Adm1 | character | sapply(Adm1, is.character) | None | ok | 0 | none | 
| 10 | Adm2 | character | sapply(Adm2, is.character) | None | ok | 0 | none | 
| 11 | Adm3 | character | sapply(Adm3, is.character) | None | ok | 0 | none | 
| 12 | Country | character | sapply(Country, is.character) | None | ok | 0 | none | 
| 13 | Altitude | integer | sapply(Altitude, is.integer) | None | ok | 0 | none | 
| 14 | Adm1 | character | is.null(Adm1) == is.null(Longitude) | ok | 0 | none | |
| 15 | Adm2 | character | is.null(Adm2) == is.null(Longitude) | None | ok | 0 | none | 
| 16 | Adm3 | character | is.null(Adm3) == is.null(Longitude) | None | ok | 0 | none | 
| 17 | Country | character | is.null(Country) == is.null(Longitude) | None | ok | 0 | none | 
| 18 | Altitude | integer | is.null(Altitude) == is.null(Longitude) | None | ok | 0 | none | 
| 19 | pH | numeric | sapply(pH, is.numeric) | None | ok | 0 | none | 
| 20 | pH | numeric | pH > = 0 | pH bigger than | ok | 1 | 11 | 
| 21 | pH | numeric | pH < = 14 | pH lesser than | ok | 0 | none | 
| 22 | Conductivity | numeric | sapply(Conductivity, is.numeric) | None | ok | 0 | none | 
| 23 | Conductivity | numeric | Conductivity > = 0 | None | ok | 0 | none | 
| 24 | CaCO3 | numeric | sapply(CaCO3, is.numeric) | None | ok | 0 | none | 
| 25 | CaCO3 | numeric | CaCO3 > = 0 | None | ok | 0 | none | 
| 26 | Sand | numeric | sapply(Sand, is.numeric) | None | ok | 0 | none | 
| 27 | Sand | numeric | sapply(Sand, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 28 | Lime | numeric | sapply(Lime, is.numeric) | None | ok | 0 | none | 
| 29 | Lime | numeric | sapply(Lime, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 30 | Clay | numeric | sapply(Clay, is.numeric) | None | ok | 0 | none | 
| 31 | Clay | numeric | sapply(Clay, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 32 | Soil_texture | character | sapply(Soil_texture, is.character) | None | ok | 0 | none | 
| 34 | P | numeric | sapply(P, is.numeric) | None | ok | 0 | none | 
| 35 | P | numeric | P > = 0 | None | ok | 1 | 1 | 
The checks part lists all erroneous records in the last column for each rule. This may be too long for printing. To this end a custom print report function only displays the first n records where n=5 is the default.
atable$Sand[20:30] = -1
profil = datadict.profile(atable, srule1)
xtable(prep4rep(profil$checks))
| Variable | Type | Rule | Comment | Execution | Error.sum | Error.list | |
|---|---|---|---|---|---|---|---|
| 1 | ID | integer | sapply(ID, is.integer) | None | ok | 0 | none | 
| 2 | ID | integer | !duplicated(ID) | None | ok | 0 | none | 
| 3 | ID | integer | ID > 0 & ID < 1754 | None | ok | 0 | none | 
| 4 | Latitude | numeric | sapply(Latitude, is.numeric) | None | ok | 0 | none | 
| 5 | Latitude | numeric | Latitude < 0 | None | ok | 0 | none | 
| 6 | Longitude | numeric | sapply(Longitude, is.numeric) | None | ok | 0 | none | 
| 7 | Longitude | numeric | Longitude < 180 & Longitude > -180 | None | ok | 0 | none | 
| 8 | Longitude | numeric | is.null(Longitude) == is.null(Latitude) | None | ok | 0 | none | 
| 9 | Adm1 | character | sapply(Adm1, is.character) | None | ok | 0 | none | 
| 10 | Adm2 | character | sapply(Adm2, is.character) | None | ok | 0 | none | 
| 11 | Adm3 | character | sapply(Adm3, is.character) | None | ok | 0 | none | 
| 12 | Country | character | sapply(Country, is.character) | None | ok | 0 | none | 
| 13 | Altitude | integer | sapply(Altitude, is.integer) | None | ok | 0 | none | 
| 14 | Adm1 | character | is.null(Adm1) == is.null(Longitude) | ok | 0 | none | |
| 15 | Adm2 | character | is.null(Adm2) == is.null(Longitude) | None | ok | 0 | none | 
| 16 | Adm3 | character | is.null(Adm3) == is.null(Longitude) | None | ok | 0 | none | 
| 17 | Country | character | is.null(Country) == is.null(Longitude) | None | ok | 0 | none | 
| 18 | Altitude | integer | is.null(Altitude) == is.null(Longitude) | None | ok | 0 | none | 
| 19 | pH | numeric | sapply(pH, is.numeric) | None | ok | 0 | none | 
| 20 | pH | numeric | pH > = 0 | pH bigger than | ok | 1 | 11 | 
| 21 | pH | numeric | pH < = 14 | pH lesser than | ok | 0 | none | 
| 22 | Conductivity | numeric | sapply(Conductivity, is.numeric) | None | ok | 0 | none | 
| 23 | Conductivity | numeric | Conductivity > = 0 | None | ok | 0 | none | 
| 24 | CaCO3 | numeric | sapply(CaCO3, is.numeric) | None | ok | 0 | none | 
| 25 | CaCO3 | numeric | CaCO3 > = 0 | None | ok | 0 | none | 
| 26 | Sand | numeric | sapply(Sand, is.numeric) | None | ok | 0 | none | 
| 27 | Sand | numeric | sapply(Sand, is.withinRange, 0, 100) | None | ok | 11 | 20,21,22,23,24 … more | 
| 28 | Lime | numeric | sapply(Lime, is.numeric) | None | ok | 0 | none | 
| 29 | Lime | numeric | sapply(Lime, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 30 | Clay | numeric | sapply(Clay, is.numeric) | None | ok | 0 | none | 
| 31 | Clay | numeric | sapply(Clay, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 32 | Soil_texture | character | sapply(Soil_texture, is.character) | None | ok | 0 | none | 
| 34 | P | numeric | sapply(P, is.numeric) | None | ok | 0 | none | 
| 35 | P | numeric | P > = 0 | None | ok | 1 | 1 | 
This may happen if the syntax is wrong. Another reason - particularly if re-using rule files across tables - maybe that a particular variable name is not present amongst the column names of the present table. The tool will just ignore it and report a 'failed' execution. Let us simply modify an existing rule as below:
srule1$Variable[25] = "caCO3"
srule1$Rule[25] = "caCO3 >= 0"
profil = datadict.profile(atable, srule1)
Now let us just look at an excerpt of the results table:
xtable(prep4rep(profil$checks[20:30, ]))
| Variable | Type | Rule | Comment | Execution | Error.sum | Error.list | |
|---|---|---|---|---|---|---|---|
| 20 | pH | numeric | pH > = 0 | pH bigger than | ok | 1 | 11 | 
| 21 | pH | numeric | pH < = 14 | pH lesser than | ok | 0 | none | 
| 22 | Conductivity | numeric | sapply(Conductivity, is.numeric) | None | ok | 0 | none | 
| 23 | Conductivity | numeric | Conductivity > = 0 | None | ok | 0 | none | 
| 24 | CaCO3 | numeric | sapply(CaCO3, is.numeric) | None | ok | 0 | none | 
| 25 | caCO3 | numeric | caCO3 > = 0 | None | failed | 0 | NA | 
| 26 | Sand | numeric | sapply(Sand, is.numeric) | None | ok | 0 | none | 
| 27 | Sand | numeric | sapply(Sand, is.withinRange, 0, 100) | None | ok | 11 | 20,21,22,23,24 … more | 
| 28 | Lime | numeric | sapply(Lime, is.numeric) | None | ok | 0 | none | 
| 29 | Lime | numeric | sapply(Lime, is.withinRange, 0, 100) | None | ok | 0 | none | 
| 30 | Clay | numeric | sapply(Clay, is.numeric) | None | ok | 0 | none | 
End of tutorial