Term
|
Definition
Data Integrity is universally critical, because every part of the analytical process that follows depends on the data on which it is based. |
|
|
Term
|
Definition
Crime Mapping is the use of a GIS to perform spatial analysis of crime and police activity |
|
|
Term
|
Definition
Geocoding Score is often called the "hit rate," often guages the level of success. A score of 80-90% is often thought of as desirable for law enforcement. That is, 80-90% of the data are represented on the map, and only these data will be analyzed. |
|
|
Term
Data Cleaning
(or Data Scrubbing) |
|
Definition
Data Cleaning is the process of correcting data integrity errors. Data Cleaning is the process of taking tabular data and correcting mistakes before being used for analysis. |
|
|
Term
|
Definition
Origination Errors occur when the data is collected or transcribed |
|
|
Term
|
Definition
Management Errors occur when the data is stored |
|
|
Term
|
Definition
Retrieval Errors occur when we query or download data to analyze it. |
|
|
Term
|
Definition
"Empties" are an Origination Error that occurs when an entire data field is null, or empty. |
|
|
Term
Typographical Errors
(or Typos) |
|
Definition
Typographical Errors are Origination Errors that occur when keystrokes aren't what the operator intends. |
|
|
Term
|
Definition
Punctuation is An Origination Error that occurs when punctuation marks are used when they should not be, or excluded when they are expected. |
|
|
Term
|
Definition
Abbreviations is an Origination Error that occurs when efforts are made to shorten words, thereby decreasing the time and effort necessary to enter them into a data table. |
|
|
Term
|
Definition
Omissions is an Origination Error that occurs when necessary data elements are erased or, more likely, not entered in the first place. |
|
|
Term
|
Definition
Alias Errors are Origination Errors that occur from the use of names and phrases that make perfect sense to the human being, but are useless to a computer. |
|
|
Term
Malapropisms
(or Malaprops or Mals) |
|
Definition
Malapropisms are Origination Errors that occur when the data is not the expected format or information. For example, instead of "123 E Main St," the writer might enter "behind the fence" or "100 yards S/B." |
|
|
Term
|
Definition
Generalizations are Origination Errors that occur when a common place name or broad location is entered rather than a specific location. A common example is the use of "hundred blocks." |
|
|
Term
|
Definition
Invalid Entries are Origination Errors that look like correct data, but are not valid. The most common type of invalid entry is the non-address. |
|
|
Term
|
Definition
Extraneous is an Origination Error that occurs when extra data is entered into a field that should contain only limited data. This often happens in address fields and especially in CAD records. |
|
|
Term
|
Definition
Management errors arise from how we store our data - usually in the form of computer data files, but not necessarily. |
|
|
Term
|
Definition
Record Truncations are Management Errors that occur when a database or file system can't hold all the data that's been put into it, resulting in some records being deleted or not accepted. |
|
|
Term
|
Definition
Field Truncations are Management Errors that occur when specific fields aren't long enough or detailed enough to hold the information that's placed in them. The most common offenders are simple text fields. |
|
|
Term
|
Definition
Field Conversions are Management Errors that happen when data is changed from one type into another type. This usually occurs when we transfer data from one system into another electronically using some kind of automation. For example, a date field might inadvertently be converted to a numeric field, causing unpredictable changes in the resulting values. |
|
|
Term
|
Definition
Physical is a Management Error that is the result of data corruption. This occurs when the physical record containing the data is damaged or misplaced. |
|
|
Term
|
Definition
Retrieval Errors may occur when the analyst retrieves information; even though the source data is accurate and reliable, the resulting search, query or reporting functions can often lead to problems. In general, retrieval errors are identical to management errors. |
|
|
Term
|
Definition
A typical data "chain of custody" looks something like this:
- Victim
- Officer
- Records Clerk
- Database Administrator
- Crime Analyst
|
|
|
Term
|
Definition
The final way to overcome data integrity problems is to compensate for them. This method is by far the weakest and least desirable way to cope with data errors; it is commonly used by crime analysts. |
|
|
Term
|
Definition
Probably the most common example of compensating for dirty data is the use of Alias Tables in GIS software. An alias table converts a known erroneous address into a valid, matchable address. The data is still dirty and the damage is not repaired. Alias tables serve only as a band-aid. |
|
|
Term
|
Definition
The two elements of data cleaning operation are the "fault," the error we're searching for, and the "fix," what we replace it with. |
|
|
Term
|
Definition
Manual Data Cleaning consists of a human operator searching through records, spotting instances of errors and replacing the faults with valid fixes. |
|
|
Term
Semi-Automatic Data Cleaning |
|
Definition
Semi-Automatic Data Cleaning consists of the human operator using an automated function to quickly perform an individual operation. |
|
|
Term
Fully Automatic Data Cleaning |
|
Definition
Fully Automatic Data Cleaning is the strongest approach for most users. This method consists of preparing a list of fault/fix operations ahead of time, which are then followed in order by a completely automated set of search and replace actions. |
|
|
Term
|
Definition
Cleaning operations can often have unintended consequences. These consequences occur when a given cleaning operation affect data other than that intended by the user (over-inclusion), or fail to clean all the targets intended by the user (under-inclusion). |
|
|
Term
|
Definition
- Never clean source data - always use a copy.
- Never replace numbers with other numbers
- Don't fix a problem with another problem
|
|
|
Term
Off-the-Shelf Automation
Data Cleaning |
|
Definition
Off-the-Shelf automation data cleaning products offer professionalism and customer support; however, it may require an initial outlay of money. |
|
|
Term
|
Definition
Homegrown data cleaning applications are programs written by someone at the local level. The support is local and the product will be tailor-made, however the support is often weak and it may be time-consuming to build. |
|
|
Term
|
Definition
Ad Hoc data cleaning applications are macro-type automations written at the local level, often by the crime analyst. Ad-hoc solutions usually employ macro automation technology such as VBA. |
|
|
Term
|
Definition
A well-written cleaning application, whether created internally or purchased from a vendor, should not have it cleaning parameters "hard-coded" in the application. It should be adaptable. Macros are inflexible and contain hard-coded operations. Applications are typically far more flexible. |
|
|