The myrwaR package includes a set of functions for retrieving data from the MyRWA Water Quality Access Database. First, we need to load the package:

Accessing the Database

Connecting to the Database

To connect to the database, use db_connect() and specify the path to the database, which returns a connection handler that is later used to retrieve the data.

base_dir <- 'D:/Dropbox/Work/mystic/db'
db_path <- file.path(base_dir, "MysticDB_20160208.accdb")
ch <- db_connect(path = db_path)

Retrieving Data from Any Table

The db_list_tables() function lists the names of tables in the database (note that dplyr has a function with the same name, so you may need to namespace it with myrwaR::):

myrwaR::db_list_tables(ch)

#>  [1] "20130622_VisitPrecip"        "20130925_TableauAnnualMeans"
#>  [3] "20140510_VisitPrecip"        "ActionLimit"                
#>  [5] "Catchment_DB"                "Characteristic"             
#>  [7] "CharacteristicNominal"       "IMPORT_LOCATIONS"           
#>  [9] "IMPORT_NOMINAL"              "IMPORT_NUMERIC"             
#> [11] "IMPORT_STORM_VISIT"          "IMPORT_VISIT"               
#> [13] "Location"                    "LocationType"               
#> [15] "Method"                      "Municipality"               
#> [17] "Project"                     "Record_DataImport"          
#> [19] "Reference"                   "Result"                     
#> [21] "ResultFlag"                  "ResultNominal"              
#> [23] "SampleFraction"              "SampleType"                 
#> [25] "SSO"                         "StormEvent"                 
#> [27] "StormEventVisit"             "StormFlag"                  
#> [29] "StormMethod"                 "Units"                      
#> [31] "Visit"                       "WaterBody"

To load the data from any one of these tables, use db_table(). For example, to get the Location table:

locations <- db_table(ch, table_name = "Location")
str(locations)

#> 'data.frame':    2544 obs. of  16 variables:
#>  $ ID                 : Factor w/ 2544 levels "1395MA","2ndRd",..: 1 2 3 4 5 6 7 8 9 10 ...
#>  $ LocationDescription: Factor w/ 372 levels ""," DS from Mass Ave, where brook crosses road, near Wilson Farms",..: NA NA NA 1 NA NA 37 NA 166 1 ...
#>  $ WaterBodyID        : Factor w/ 77 levels "Aberjona River",..: 46 25 46 1 1 1 1 1 31 1 ...
#>  $ MunicipalityID     : Factor w/ 22 levels "Arlington","Belmont",..: 1 17 1 20 20 20 20 20 20 20 ...
#>  $ Latitude           : num  42.4 42.5 42.4 42.4 42.4 ...
#>  $ Longitude          : num  -71.2 -71.1 -71.2 -71.1 -71.1 ...
#>  $ LocationTypeID     : int  22 22 22 22 22 22 22 22 20 22 ...
#>  $ LocationMethod     : Factor w/ 5 levels "GPS-Unspecified",..: 1 1 1 1 2 2 1 1 1 1 ...
#>  $ CoordinateSystem   : Factor w/ 2 levels "","WGS84": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ InWatershed        : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ RegionalID         : Factor w/ 2 levels "Lower Mystic River",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ WaterType          : Factor w/ 2 levels "Fresh","Saline": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Owner              : Factor w/ 11 levels "","Arlington",..: NA NA NA 1 NA NA NA NA NA 1 ...
#>  $ StaffGage          : int  0 0 0 0 0 0 1 0 0 0 ...
#>  $ County             : Factor w/ 2 levels "Middlesex","Suffolk": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CatchmentArea_Sqmi : num  NA NA NA NA NA NA NA NA NA NA ...

Retrieving Water Quality Data

If a database connection has already been created, water quality data can be retrieved using db_results(). This function will automatically join the Visit and Result tables to provide the location IDs, datetimes, and useful attributes for each row in the Result table. Note that this function will not join the Location table, and thus not contain latitudes and longitudes. See the load_wq() function in the next section.

results <- db_results(ch)

#> Warning: closing unused RODBC handle 8

Water Quality Dataset Function

Finally, to connect to the database and retrieve the water quality data with the location information, use the load_wq() function (note that the name of this function differs from other database functions to maintain consistency with older R scripts used by MyRWA). The load_wq() function will connect to the database, and then merge the Result, Visit, and Location database tables into a single dataframe.

wq <- load_wq(path = db_path, sample_types = c("S"), exclude_flags = TRUE)
str(wq)

#> 'data.frame':    183279 obs. of  33 variables:
#>  $ LocationID         : Factor w/ 1013 levels "ABR001","ABR002",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ VisitID            : int  28280 26540 28280 28280 26540 28280 27527 26540 26540 26879 ...
#>  $ ID                 : int  140905 133640 140906 140901 133641 140902 137608 133638 133642 135375 ...
#>  $ CharacteristicID   : Factor w/ 68 levels "ALK","ARSENIC",..: 17 16 43 48 15 15 15 19 48 17 ...
#>  $ ResultValue        : num  448 57.57 717 717 5.38 ...
#>  $ Units              : Factor w/ 13 levels "%","CFU/100ml",..: 2 1 10 13 6 6 6 2 13 2 ...
#>  $ Qualifier          : Factor w/ 3 levels "","<",">": NA NA NA NA NA NA NA NA NA NA ...
#>  $ FlagID             : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
#>  $ MethodID           : int  3 24 69 71 24 24 24 17 71 2 ...
#>  $ ResultComment      : Factor w/ 48 levels ""," ","(bacteria) perservative (blue pill) fell out of one of bottles",..: NA NA NA NA NA NA NA NA NA NA ...
#>  $ SampleDepth        : num  NA NA NA NA NA NA NA NA NA NA ...
#>  $ UniqueID           : Factor w/ 25751 levels "197703211400_WQ77_22_DEP_S",..: 16404 7329 16404 16404 7329 16404 11131 7329 7329 8887 ...
#>  $ Datetime           : POSIXct, format: "2009-08-26 06:59:00" "2003-06-24 05:18:00" ...
#>  $ ProjectID          : Factor w/ 9 levels "BASE","BHWQM",..: 5 5 5 5 5 5 5 5 5 5 ...
#>  $ SampleTypeID       : Factor w/ 1 level "S": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ HasFlow            : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ SampleDepthType    : Factor w/ 2 levels "B","S": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ Comment            : Factor w/ 2086 levels ""," ","    Tributary",..: NA NA NA NA NA NA NA NA NA 674 ...
#>  $ LocationDescription: Factor w/ 302 levels ""," DS from Mass Ave, where brook crosses road, near Wilson Farms",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ WaterBodyID        : Factor w/ 63 levels "Aberjona River",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ MunicipalityID     : Factor w/ 22 levels "Arlington","Belmont",..: 20 20 20 20 20 20 20 20 20 20 ...
#>  $ Latitude           : num  42.4 42.4 42.4 42.4 42.4 ...
#>  $ Longitude          : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
#>  $ LocationTypeID     : int  22 22 22 22 22 22 22 22 22 22 ...
#>  $ LocationMethod     : Factor w/ 5 levels "GPS-Unspecified",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CoordinateSystem   : Factor w/ 2 levels "","WGS84": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ InWatershed        : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ RegionalID         : Factor w/ 2 levels "Lower Mystic River",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ WaterType          : Factor w/ 2 levels "Fresh","Saline": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Owner              : Factor w/ 9 levels "","Arlington",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ StaffGage          : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ County             : Factor w/ 2 levels "Middlesex","Suffolk": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ CatchmentArea_Sqmi : num  NA NA NA NA NA NA NA NA NA NA ...

The sample_types parameter lets you specify which sample types to include in the data frame. By default, it will only return routine samples ("S"). To include other sample types such as field duplicates and field blanks, just add other SampleType IDs sample_types = c("S", "FD", "FB"), or set sample_types = NULL to retrieve all sample types. To see a complete list of sample types, load the SampleType table: db_table(ch, table_name="SampleType").

The exclude_flags parameter specifies whether the resulting data frame should exclude flagged samples (default is TRUE). To include the flagged samples, set this to FALSE.

Closing the Connection

To close a connection, simply use the close() function:

close(ch)

QAQC Procedures

Version Comparison

The compare_database() function summarizes changes between to database versions. This function loads the Result, Visit, and Location tables from each database version, and then summarizes how many rows have been added or removed from the first to the second version. Note that this does not look at changes to the values of each row, it only checks for row additions and removals.

compare_database(old_path = file.path(base_dir, "MysticDB_20160120.accdb"),
                 new_path = file.path(base_dir, "MysticDB_20160208.accdb"))

#> MyRWA WQ Database Comparison
#>   Old DB:  MysticDB_20160120.accdb 
#>   New DB:  MysticDB_20160208.accdb 
#> ================================================================================
#> Result Table
#> 
#> # Rows Added: 799 
#> Summary of Added Results:
#>        ID           CharacteristicID       Units    
#>  Min.   :446011   SPCOND    :100     mg/l     :375  
#>  1st Qu.:446211   DO        : 98     MPN/100ml:100  
#>  Median :446410   TEMP_WATER: 96     uS/cm    :100  
#>  Mean   :446410   DO_SAT    : 94     deg C    : 96  
#>  3rd Qu.:446610   NO23      : 77     %        : 94  
#>  Max.   :446809   TP        : 77     PSU      : 23  
#>                   (Other)   :257     (Other)  : 11  
#> 
#> # Rows Removed: 46 
#> Summary of Removed Results:
#>        ID         CharacteristicID       Units   
#>  Min.   :278223   DO     : 6       mg/l     :24  
#>  1st Qu.:278234   ENT    : 6       MPN/100ml: 6  
#>  Median :278246   NO23   : 6       uS/cm    : 6  
#>  Mean   :278246   SPCOND : 6       %        : 5  
#>  3rd Qu.:278257   TP     : 6       deg C    : 5  
#>  Max.   :278268   TSS    : 6       CFU/100ml: 0  
#>                   (Other):10       (Other)  : 0  
#> ================================================================================
#> Visit Table
#> 
#> # Rows Added: 102 
#> ProjectIDs with Added Visits: BASE HOTSPOT 
#> LocationIDs with Added Visits: ABR006 ABR028 ABR031 ABR036 ABR049 ALB006 ARL007 ARL010 ARL013 ARL014 ARL026 ARL027 ARL028 ARL136 BEI001 CHR95S MAR036 MEB001 MIB001 MIC004 MYR071 MYR275 MYRMMP SOMD05 SOMD07 SOMD08 SOMD09 UPL001 WIB001 WIN48WASH WINSWANGP WINWEDGE WINwse5 WINx03 WOB70-1 
#> Summary of Added Visits:
#>        ID            Datetime                     LocationID   ProjectID 
#>  Min.   :137296   Min.   :2015-01-23 06:20:00   MYR275 :14   BASE   :77  
#>  1st Qu.:137321   1st Qu.:2015-06-05 06:21:00   MYRMMP :14   HOTSPOT:25  
#>  Median :137347   Median :2015-10-02 07:03:30   BEI001 :13               
#>  Mean   :137348   Mean   :2015-08-26 10:04:27   CHR95S :13               
#>  3rd Qu.:137378   3rd Qu.:2015-12-02 10:28:45   MIC004 :12               
#>  Max.   :137403   Max.   :2015-12-16 07:23:00   ABR006 : 2               
#>                                                 (Other):34               
#> 
#> # Rows Removed: 6 
#> ProjectIDs with Removed Visits: BASE 
#> LocationIDs with Removed Visits: BEI093 CHR95S MIC004 MYR275 MYRMMP 
#> Summary of Removed Visits:
#>        ID           Datetime                    LocationID ProjectID
#>  Min.   :43393   Min.   :2015-01-23 06:20:00   BEI093:1    BASE:6   
#>  1st Qu.:43394   1st Qu.:2015-01-23 06:20:00   CHR95S:1             
#>  Median :43396   Median :2015-01-23 06:24:00   MIC004:1             
#>  Mean   :43396   Mean   :2015-01-23 06:25:10   MYR275:2             
#>  3rd Qu.:43397   3rd Qu.:2015-01-23 06:29:30   MYRMMP:1             
#>  Max.   :43398   Max.   :2015-01-23 06:33:00                        
#> ================================================================================
#> Location Table
#> 
#> # Rows Added: 0 
#> 
#> # Rows Removed: 0

The output of the compare_database() function can also be saves to a text file by setting the log_file argument.

compare_database(old_path = file.path(base_dir, "MysticDB_20150227.accdb"),
                 new_path = file.path(base_dir, "MysticDB_20150529.accdb"),
                 log_file = file.path(base_dir, 'compare_20150227_20150529.txt'))

Database QAQC Checks

The db_qaqc_suite() function runs a series of checks on the Result and Visit tables These checks are designed to find invalid values in each column of these tables. The types of checks include:

inconsitencies in lower/upper case
presence of empty strings, which should be replaced with NA’s (e.g. "")
presence of untrimmed strings that contain an extra space before or after the value (e.g. "FD ")
presence of NA in columns that should not have missing values
invalid date ranges (all dates should be between 1900-01-01 and the today’s date)
invalid relationships between tables (e.g. Result.Units->Units.ID, Result.FlagID->ResultFlag.ID, Visit.SampleTypeID->SampleType.ID, and Visit.ProjectID->Project.ID)

Note that the first three checks are included because Microsoft Access does not enforce consistency across these issues and thus may go unnoticed within the database. However, these issues can cause problems when importing the results into R, which does require these consistencies.

For each set of checks, db_qaqc_suite() function will report which columns failed, the reason, and the number of rows that failed each check. The specific rows that failed can be shown by setting print.values=TRUE (not shown here for brevity).

ch <- db_connect(db_path)
db_qaqc_suite(ch, print.rows=FALSE)

#> Loading tables...done
#> 
#> Checking Result table...
#> .. Checking for empty strings 
#> .... CharacteristicID ...OK
#> .... Units ...OK
#> .... Qualifier ...FAILED
#> 
#> ERROR: There are 600 row(s) with empty strings in column "Qualifier" 
#> 
#> .... FlagID ...OK
#> .. Checking for lowercase strings 
#> .... CharacteristicID ...OK
#> .... FlagID ...FAILED
#> 
#> ERROR: There are 2 row(s) with lowercase strings in column "FlagID" 
#> 
#> .. Checking for untrimmed strings 
#> .... CharacteristicID ...OK
#> .... Units ...OK
#> .... Qualifier ...OK
#> .... FlagID ...FAILED
#> 
#> ERROR: There are 34 row(s) with untrimmed strings in column "FlagID" 
#> 
#> .. Checking for invalid units 
#> .... Units ...OK
#> .. Checking for invalid flags 
#> .... FlagID ...FAILED
#> 
#> ERROR: There are 36 row(s) with invalid flags in column "FlagID" 
#> 
#> Checking Visit table...
#> .. Checking for empty strings 
#> .... UniqueID ...OK
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...OK
#> .... SampleDepthType ...OK
#> .. Checking for lowercase strings 
#> .... ProjectID ...OK
#> .... SampleTypeID ...OK
#> .... SampleDepthType ...OK
#> .. Checking for untrimmed strings 
#> .... UniqueID ...FAILED
#> 
#> ERROR: There are 12 row(s) with untrimmed strings in column "UniqueID" 
#> 
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...FAILED
#> 
#> ERROR: There are 12 row(s) with untrimmed strings in column "SampleTypeID" 
#> 
#> .... SampleDepthType ...OK
#> .. Checking for NA values 
#> .... UniqueID ...OK
#> .... LocationID ...OK
#> .... ProjectID ...OK
#> .... SampleTypeID ...FAILED
#> 
#> ERROR: There are 9 row(s) with NA values in column "SampleTypeID" 
#> 
#> .... SampleDepthType ...FAILED
#> 
#> ERROR: There are 8774 row(s) with NA values in column "SampleDepthType" 
#> 
#> .. Checking for datetimes out-of-range 
#> .... Datetime ...OK
#> .. Checking for invalid sample type IDs 
#> .... SampleTypeID ...FAILED
#> 
#> ERROR: There are 12 row(s) with invalid sample type IDs in column "SampleTypeID" 
#> 
#> .. Checking for invalid project IDs 
#> .... ProjectID ...OK

#> Warning in db_qaqc_suite(ch, print.rows = FALSE): Database did not pass
#> validation checks, see details above

close(ch)

MyRWA Database Functions

Jeffrey D. Walker, PhD

2016-02-16