Data Quality Control – A new blog series discussing the Why’s, What’s and How’s of Data Quality Assessment. First the Why:
A planetary body shakes and this signal is recorded for later analysis. But getting from ground motion to its digital representation is fraught with potential issues all along the way: all the environmental influences, vault design, the recording instrument, its digitizer, communications, meta-data veracity, data assemblage, distribution mechanisms, etc.; where each of these and more contribute to the overall quality of the final product. Just how accurately does the data represent what really occurred back on the ground?
With so many parties involved in producing the final data-set, we must assess that the data is, in fact, usable for analysis purposes; because when the data is suspect in any way, so then is any analysis thereof. In this series, I will be looking at what this all means: the ways in which data quality can be accurately assessed using examples of both good and bad data.
For our first illustration, a look at some data recorded on the moon back in ’70 and ’71. After downloading data from Apollo Mission 12, I analyzed it using the PSD PDF method devised by D.E. McNamara (USGS-NEIC). This method, very briefly, transforms all time-domain data to its frequency-domain equivalent, grouping all PSD’s together to render a PSD Probability Density Function (PSD PDF).
The PSD PDF plot for these two years of data, horizontal channel MH1 from station S12:
And we can immediately see something a bit strange. At around two seconds, at higher noise levels a trough, while at lower noise levels a peak. What’s going on? Turns out that the response provided with the data-set is a nominal response not reflecting the true response of the instrument itself. So because the response is inaccurate, so too are the resulting PSD’s seen in our PSD PDF.
Here we have an instance of inaccurate meta-data – the response file – where this problem is clearly visible in our PSD PDF. And now we know that if the data around two seconds is to be analyzed or used further, we need a more accurate response file first before we can proceed.
PSD PDF’s are very handy indeed. And with a database filled with these PSD’s, we can group them in other ways to visualize and highlight other types of problems. Stay tuned for more posts on Data QC using this data-set and other views.
– Richard Boaz
(Special thanks to Y. Nakamura for invaluable insight.)