Summary
WYDOT informed ODE team that the Driver Alert location data was coming out as zeros. ODE team investigated the data files provided by WYDOT team and confirmed the inconsistency. Further investigation into the code base revealed that a bug was introduced when changes enhanced metadata.serialId
field to allow more effective auditing and tracking of data in order to detect missing and duplicate records. The defect was introduced in the logic that calculates the metadata and NOT the payload; there was no impact to the data payload at any time.
The ODE team promptly released a patch to correct the error in the logic and released it to WYDOT for testing. Wyoming team pulled down the patch from GitHub, deployed it to development environment and confirmed the fix. Wyoming then deployed the fix to production environment and resumed the flow of data through the system.
Unfortunately, three weeks later, WYDOT detected an issue with the released patch and informed ODE team that bsmSource metadata field is missing from the ODE output stream (https://github.com/usdot-jpo-ode/jpo-ode/issues/316 ). ODE team investigated the issue and confirmed the defect. Investigation into the code base revealed that a bug was introduced when the previous bug fix for metadata was implemented in mid February. The defect was introduced in the same area of logic that calculates the metadata.
On 3/6/2019 in the ODE Sprint Review and Planning meeting, both issues were discussed and the following action items where recorded and assigned.
Details
This section describes the origin of the issues and their resolution.
Origin
An update to the LogFileToAsn1CodecPublisher class was made to support easier troubleshooting of missing or duplicate records and was included in (PR #263, commit). This class takes the raw payload data and wraps it in metadata to create a distribution-ready message. In this class, the metadata object is created once and then updated for each message that passes through. Reusing the metadata object instead of recreating it is done for performance reasons: it reduces memory usage and improves message processing time. The update changed the logic of how this metadata was populated and a statement for updating the metadata was missed. Therefore the metadata was not updated between each message and instead reused with the same information, resulting in the duplicate location and timestamp fields populated in all metadata generated for records in the same data file.
The new issue was a result of the bug fix implemented for the original metadata defect: https://github.com/usdot-jpo-ode/jpo-ode/pull/313 .
Resolution
The bug fix was merged in pull request #313 on GitHub in accordance with published ODE Software Quality Policy. This fixed the bug by correctly adding the code to update the metadata between each message. Locally the developer tested this fix by running the ODE, uploading data files, and visually inspecting the metadata of the output messages. After the developer was confident with local testing that the fix was effective, they pushed their code to GitHub. The ODE developers performed the required peer code review on GitHub where the code changes were inspected for correctness and completeness. Once the code review verified the fix, the code was merged and promoted through the standard release process and WYDOT was notified that the fix was ready for User Acceptance Testing (UAT).
A fix for the second bug has also been identified and implementation and deployment plan is in progress.
Impact
The bug time periods when data corruption occurred is specified below.
Data Type | Deposited/odeReceivedAt Time Period |
---|---|
BSM with inconsistent metadata | 12/3/2018 - 2/12/2019 |
Received TIMs with inconsistent metadata (NOT Broadcast TIMs) | 12/3/2018 - 2/12/2019 |
Driver Alert with inconsistent metadata | 12/3/2018 - 2/12/2019 |
Distress Notifications with inconsistent metadata | 12/3/2018 - 2/12/2019 |
BSM with missing bsmSource metadata field | 2/13/2019 -
|
Data NOT impacted include:
- Broadcast TIMs deposited to WYDOT, DataHub and SDC. Note: DataHub only receives broadcast TIMs from WYDOT
- TIM, Driver Alert and Distress Notifications deposited since 2/13/2019
Resolution and Recovery
1. What data stores require corrective actions to clean data?
The original defective code was deployed at WYDOT's new Production (PROD) server sometime in late November. The PROD server was brought online at the beginning of December (12/3/2018). WYDOT initiated their own clean up of WYDOT internal data logging on 2/13/2019 and began re-uploading all the data from 12/3/2018 to 2/12/2019. It is envisioned that BSM data will need to be uploaded again to recover from missing bsmSource issue. To avoid having invalid data in SDC and DataHub repositories, these repositories will also need to remove the invalid data and await WYDOT's re-upload for the good data to replace invalid data. ODE team is assisting the DataHub team in performing the corrective actions supported by automated AWS Lambda functions to ensure ONLY invalid data is removed. In summary:
- Wyoming: Wyoming has removed all of the ODE generated data from their database during the 12/3/2018 to 2/12/2019 period and has re-uploaded all of the original log files back through ODE and into data store. However due the the bsmSource issue, all BSM data need to be removed and log files be uploaded again.
- DataHub: All affected Wyoming data with
metadata.odeReceivedAt >=
12/3/2018 until
AND metadata.recordGeneratedBydateOfBugFix
!= TMC
should be removed from S3 bucket. Also a notification message will be posted on DTG and Sandbox to alert users of the inconsistencies and the time frame of the impending correction. - SDC: All affected Wyoming data with
metadata.odeReceivedAt >=
12/3/2018 until
ANDdateOfBugFix
metadata.recordGeneratedBy
!= TMC
should be removed from data lake / raw submissions bucket and the Data Warehouse. Also a notification message will be posted on SDC site to alert users of the inconsistencies and the time frame of the impending correction. - End Users: A notification message will be posted to inform the users of the inconsistencies and the time frame of the impending correction.
2. What process will be used to identify, isolate, and remove invalid data?
All data uploaded in the bug time periods with the exception of Broadcast TIM will be invalid. To ensure that no valid data will be removed from the system, an AWS Lambda function is being created to collect metrics that can be used to verify and systematically and correctly remove invalid data.
For DataHub, the AWS Lambda function will help identify the count and location of invalid metadata. Analysis of this report and any additional checks required by the data store maintainers, will determine the scope of the clean up. The Lambda function can then be used to automatically delete the invalid data from S3.
For SDC, ODE team will work with the SDC team to remove invalid data using mutually agreed methods to identify invalid data and generate a report of affected files. Again, analysis of this report and any additional checks required by the data store maintainer, will determine the scope of the invalid data and resulting deletions from data lake / raw submissions bucket and the Data Warehouse.
The following metrics will aid in assuring the integrity of the data deposited to SDC and DataHub during the bug time period.
Pseudo-queries for Identifying Invalid Data
The pseudo-queries in this section will help data store maintainers find and possibly clean-up the data. The primary purpose of these quesies are to collect metrics for verifying that the clean-up has been successful in removing only the invalid data. The key part of these queries is the WHERE clause. The SELECT clause can be modified for both collecting metrics and modified to perform clean-up in an automated fashion. It is up to the data maintainers to use these pseudo-queries in any way that helps them identify and clean-up the repositories.
These sections are intended to be run somewhat sequentially as a checklist or procedure for analysis and cleanup.
Useful variable declaration
totalRecordCount = SELECT COUNT(*) WHERE
AND metadata.odeReceivedAt >=
12/3/2018
metadata.odeReceivedAt < dateOfBugFix
This variable should contain total count of records including valid and invalid records deposited in the repository in the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
timBroadcastRecordCount = SELECT COUNT(*) WHERE
metadata.odeReceivedAt >=
12/3/2018 AND
AND metadata.odeReceivedAt < dateOfBugFix
metadata.recordGeneratedBy == TMC
This variable should contain the count of TIM Broadcast messages only deposited in the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
goodOtherRecordCount = SELECT COUNT(*) WHERE
metadata.odeReceivedAt >=
2/13/2019 AND
AND metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayloadmetadata.odeReceivedAt < dateOfBugFix
AND
metadata.recordGeneratedBy != TMC
This variable should contain the count of all valid non-BSM messages deposited during the bug time periods. This count should be equal to the count of TIM Broadcast messages for the entire bug period (12/3/2018 - dateOfBugFix) plus received TIM messages deposited on and after 2/13/2019 through dateOfBugFix.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
badBsmRecordCount = SELECT COUNT(*) WHERE
metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This variable should contain the count of all invalid BSM records deposited during the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
badOtherRecordCount = SELECT COUNT(*) WHERE
metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayloadmetadata.recordGeneratedBy != TMC AND
This variable should contain the count of all invalid non-BSM messages deposited during the bug time periods. This count should be equal to the count of invalid received messages for the period 12/3/2018 - 2/12/2019.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
removableRecordCount = badBsmRecordCount + badOtherRecordCount
This is to total count of invalid records and can safely be removed from the repository.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
totalRecordCount equal to (==) (timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount)
This check should validate the analysis of the record counts. totalRecordCount should be equal to the sum of timBroadcastRecordCount + removableRecordCount + goodOtherRecordCount.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
earliestGeneratedAt = SELECT MIN(
WHERE metadata.recordGeneratedAt)
(
metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayloadmetadata.recordGeneratedBy != TMC AND
) OR
(
)metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This query returns the generatedAt timestamp of the first record deposited during the bug time periods. This variable will be used during the validation of the cleanup.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
latestGeneratedAt = SELECT MAX(
WHERE metadata.recordGeneratedAt)
(
metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayloadmetadata.recordGeneratedBy != TMC AND
) OR
(
)metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This query returns the generatedAt timestamp of the last record deposited during the bug time periods. This variable will be used during the validation of the cleanup.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
Query the Count of Invalid Data
invalidRecordCount = removableRecordCount
This variable is the same as removableRecordCount and is here for clarification only,
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
invalidS3FileCount = SELECT COUNT(s3-filename) WHERE
(metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) ORmetadata.recordGeneratedBy != TMC AND
(
)metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This variable is the count of invalid S3 bucket files.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
Query a List of Invalid Data Files on S3
invalidS3Files = SELECT s3FileName WHERE
(metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) ORmetadata.recordGeneratedBy != TMC AND
(
)metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This query returns a list of all S3 bucket files that are invalid. This list can be used to verify the clean up. This list can also be saved to a file to be used as input to other scripts.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
Query a List of source data files resulting in invalid data
listOfLogFilesBefore = SELECT
metadata.logFilename
WHERE
(metadata.odeReceivedAt >=
12/3/2018
AND metadata.odeReceivedAt
<= 2/12/2019 AND
metadata.payloadType != us.dot.its.jpo.ode.model.OdeBsmPayload) ORmetadata.recordGeneratedBy != TMC AND
(
)metadata.odeReceivedAt >=
12/3/2018 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.payloadType == us.dot.its.jpo.ode.model.OdeBsmPayload
This query returns a list of all input source files that resulted in invalid S3 files. This list can be used to verify the clean up. This list can also be saved to a file to be used as input to other scripts.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
3. What process will be used for verification of corrective actions to address data store (WYDOT, SDC and DataHub) clean-up?
Manual:
- A sampling of messages will be manually inspected to check that all fields in the metadata are correctly populated with data that makes sense
Automated:
- At initial implementation (future context checks will be added as determined valuable by the team and approved Product Owners) AWS Lambda function will iterate through all messages with data from the bug time period to perform several context checks including but not limited to the following:
- Timestamps are not duplicated
- Location data are not duplicated
- Serial numbers are sequential and not duplicated
- Key required fields are not missing
- The AWS Lambda function is being developed and tested. In prototype-state it has been used initially to identify potentially invalid data in the month of December. This function will be used to analyze all of the data in the affected time period and identify invalid data.
- Removal of the data will not be performed until a later date when the report of corrupted data files is determined to be complete and correct
Pseudo-queries for Validating the Clean Up (after the completion of the clean-up)
Query the Count of Valid Data
validRecordCount = SELECT COUNT(*) WHERE
metadata.odeReceivedAt >= 2/13/2019 AND
AND metadata.recordGeneratedBy != TMC AND metadata.odeReceivedAt < dateOfBugFix
metadata.recordGeneratedAt >=
earliestGeneratedAt AND
metadata.recordGeneratedAt <= latest
GeneratedAt
[
AND metadata.serialId.bundleSize > 1
]
This variable is the count of valid records in the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
validS3FileCount = SELECT COUNT(s3-filename) WHERE metadata.odeReceivedAt >= 2/13/2019 AND
AND metadata.recordGeneratedBy != TMC metadata.odeReceivedAt < dateOfBugFix
AND
metadata.recordGeneratedAt >=
earliestGeneratedAt AND
metadata.recordGeneratedAt <=
latestGeneratedAt
[
AND metadata.serialId.bundleSize > 1
]
This variable is the count of valid S3 files deposited in the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
Note: Due to the fact that the bug was introduced with serialId changes, most if not all invalid metadata records will have a bundleSize of greater than one. Comparing the counts with and without this filter parameter will aid in identifying invalid data.
[bracket] indicates optional parameter
Query a List of Valid Data Files on S3
validS3Files = SELECT s3FileName WHERE
AND metadata.recordGeneratedBy != TMC [metadata.odeReceivedAt >= 2/13/2019 AND
metadata.odeReceivedAt < dateOfBugFix
AND metadata.serialId.bundleSize > 1
]
This query returns a list of S3 files from the bug time periods.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
[bracket] indicates optional parameter
Query a List of source data files
listOfLogFilesAfter = SELECT
metadata.logFilename
WHERE
AND metadata.recordGeneratedBy != TMC metadata.odeReceivedAt >= 2/13/2019 AND
metadata.odeReceivedAt < dateOfBugFix
[
AND metadata.serialId.bundleSize > 1
]
This query returns the list of source log files uploaded to ODE during the bug time period and should be the same before or after the clean up.
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
Metadata Bug Analysis and Recovery Reports
WYDOT | DataHub | SDC |
---|---|---|
DataHub Metadata Bug Analysis and Recovery Report |
Validation Checklist
WYDOT | DataHub | SDC |
---|---|---|
|
|
|
4. What are target timelines for additional Quality Assurance, Bug Fix, UAT testing of the bug fix, identifying, correcting and validating the data?
Task | Description | Owner | Target Completion Date | Actual Completion Date | |
---|---|---|---|---|---|
1 |
| ODE Will Incorporate a QA Checklist and Script into their QA process. | BAH | 3/19/2019 (End of ODE Sprint 43) | 3/19/19 |
2 |
| ODE will fix the missing bsmSource defect and run the QA Validation checklist and Scripts to verify the and regression test. This fix will be on top of the new SDW feature implementation. | BAH | 3/19/2019 | 3/19/19 |
3 |
| ODE will deploy an AWS lambda function to check for data inconsistencies as they arrive on DataHub. Such inconsistencies may include SerialId serial numbers grossly out of order or repeating, time stamps repeating, required fields missing or null, etc. These Lambda functions will be shared with the community on GitHub so SDC and WYDOT teams will also be able to utilize it for data validation. - ODE-1230Getting issue details... STATUS | BAH | 4/3/2019 | 4/4/19 |
4 |
| ODE will release the validated ODE software to start UAT testing by WYDOT. | BAH | 3/19/2019 | 3/19/19 |
5 |
| WYDOT will deploy jpo-ode-1.0.5 on their DEV server and perform UAT testing. The checklist and validation scripts will be available to them for test and validation. SDC team will test in their test environment. | WYDOT | 3/26/2019 | 3/25/19 |
6 |
| This item is not related to metadata bug but for sake of coordination this task was inserted to assess whether it would be beneficial and more efficient to perform folder restructuring at the same time as data clean-up and before data re-upload. It was ultimately decided and approved by PO that folder restructuring adds unnecessary complexity to the clean-up process and schedule and is best to be deferred until after the data has been completely restored. See - ODE-1229Getting issue details... STATUS and - RDA-674Getting issue details... STATUS for details. Excerpt from approval email:
| BAH | 3/27/2019 | |
7 |
| ALL | week of 3/20/2019 | 3/25/19 | |
8 |
| ODE, DataHub and SDC teams will verify that freshly arrived BSM, TIM and Driver Alert messages are correct and consistent. The validation checkers deployed to both systems should also verify data validity in DEV. COORDINATION MEETING TO CONFIRM ALL TEAM ARE RECEIVING CLEAN DATA FROM WYDOT GO/ NO-GO FOR WYDOT PROD DEPLOYMENT
|
|
4/11/2019 | 4/11/2019 |
9 |
| UAT testing completed; Wyoming will promote v1.0.7 to PROD environment AFTER coordination meeting with stakeholders ref: - ODE-1211Getting issue details... STATUS -→ - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS | WYDOT |
4/11/2019 | 4/11/2019 |
10 |
| ODE collects counts as specified in the Useful variable declarations section. This exercise will identify earliest and latest generatedAt time for bad data. These counts can be used in cleanup verification. - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS |
| 4/17/2019 | |
11 |
| Run the queries mentioned in the Query the Count of Invalid Data section above. Note: BSM and TIMs Data Types may require separate cleanup processes as affected BSMs can likely be identified and removed without additional analysis. TIM messages will likely need additional effort to isolate affected TIM due to inability to re-upload unaffected Broadcast TIMs. Since DataHub repository only contains Broadcast TIM which were not affected by the bug, no action is required for DataHub TIM cleanup. Only invalid BSMs received from 12/3/2018-dateOfBugFix will need to be removed. SDC would need to remove all received TIMs for the received period of 12/3/2018 - 3/12/2019. WYDOT has already removed all invalid TIM and the original invalid BSM. Only BSM received from 2/13/2019 - dateofBugFix must be removed and uploaded again. WYDOT, SDC and DataHub will have all BSM records refreshed after the bug fix is deployed to WYDOT PROD server. WYDOT should review whether their deduplication software will remove erroneous data or retain the erroneous data. - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS |
| 4/19/2019 | |
12 |
| Run the query mentioned in the Query a List of Invalid Data Files on S3 section above. - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS |
| 4/19/2019 | |
13 |
| Run the query mentioned in the Query a List of source data files resulting in invalid data section above. - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS |
| 4/19/2019 | |
14 |
| The results of the queries from tasks 9-12 should be aggregated into a report for summary and understanding to be presented to the product owner. - ODE-1200Getting issue details... STATUS - ODE-1202Getting issue details... STATUS |
| 4/19/2019 | |
15 |
| COORDINATION MEETING CONFIRM ALL TEAMS' DATA REMOVAL PLANS WITH ALL STAKEHOLDERS ODE-1212 GO/ NO-GO on DATA REMOVAL |
|
Recovered Schedule and conducted meeting on 4/19/2019 | 4/19/2019 |
16 |
| Once confidence in the summary findings is gained, the Lambda function used to run the queries in tasks 9-12 should be modified to delete the S3 files that are found using the queries. Running this function will be the cleanup step. - ODE-1212Getting issue details... STATUS -→ - ODE-1203Getting issue details... STATUS - ODE-1204Getting issue details... STATUS |
| 4/19/2019 | |
17 |
| The queries mentioned in the above section for Pseudo-queries for Validating the Clean Up:
should be run and ensured that the expected output matches the actual output. - ODE-1203Getting issue details... STATUS - ODE-1204Getting issue details... STATUS |
| ||
18 |
| The Validation Checklist above should be iterated to ensure that the cleanup actions deleted the invalid data and did not affect the valid data. In addition to the automatic validation steps using the queries, manual inspection of the bucket should be performed as a sanity check. - ODE-1203Getting issue details... STATUS - ODE-1204Getting issue details... STATUS |
| ||
19 |
| COORDINATION MEETING TO CONFIRM GO/ NO-GO ON WYDOT RE- UPLOAD - ODE-1269Getting issue details... STATUS DataHub and SDC expect the following to be re-uploaded by WYDOT:
| 4/26/2019 | ||
20 |
| WYDOT team starts re-uploading of all data files that were identified during the analysis phase as invalid and deposits them to respective data stores.
Due to inconsistencies between data stored in WYDOT database and SDC, investigations was initiated with the following results:
| WYDOT | 4/30/2019 | 4/30/2019 |
21 |
|
|
| 5/14/2019 | 5/14/2019 |
22 |
|
|
| 5/22/2019 | |
23 |
| On 5/23/2019, Brandon will re-upload all data from 12/3/2018 00:00:00 UTC through 5/15/2019 00:00:00 UTC. | WYDOT | 5/23/2019 | |
24 |
| DataHub will post a Release Note indicating that there are and will be duplicate records in the data until further notice. When that is depends on Lear’s fixing of the firmware. (Lien, Julia [USA] (Unlicensed), Michael Middleton (Unlicensed)) | BAH | 5/23/2019 | |
25 |
| Hamid Musavi (Unlicensed) will update the metadata bug Confluence page with current status, understanding and action plan. Meanwhile, copying @Ariel.Gold@dot.gov so she is informed of where we are in the re-upload process. | BAH | 5/17/2019 | |
26 |
|
| 6/17/2019 | 5/3/2019 held meeting, but "NO-GO" - Data re-upload issue resolution ongoing with daily meet-up:
| |
27 |
| Communicate to all data users of the WYDOT, SDC, and document on DataHub that the cleanup is complete | BAH | 5/31/2019 |
Prevention
1 - What preventative measures will be put in place to better detect future potential “bad” data defect events and notify ODE stakeholders of potentially invalid data
Immediate actions:
- ODE unit test suite will be enhanced to detect defects of this nature at development time ( - ODE-1158Getting issue details... STATUS )
- An AWS Lambda function will be deployed on DataHub and shared with SDC team to be deployed there to capture and automatically alert Operations teams and Data Provider when certain key contextual test rules are violated. For example, out of sequence or duplicate serialId or timestamp; zero or null values in required fields such as lat/long/speed/heading if persistent over a period of time. Etc. ( - ODE-1199Getting issue details... STATUS )
Long-term actions pending Product Owner approval:
- Automated integration testing will be introduced to the ODE Software Quality Policy to provide end-to-end test coverage through test harness implementation; significantly reducing reliance on visual inspection during local development and prior to Prod deployments
- Data Validation feature added to ODE to flag any potential data elements that are outside a preconfigured validity range
- Contextual data quality checker module will be added to the ODE backlog and present to Product Owner for approval and prioritization.
Note: Currently payload data (ODE inputs) are validated against the J2735 standard with essentially 100% unit test coverage; only current gap for payload testing is in validation of incoming JSON during encoding of TIMs. Risk in this area is low given that the encoder module will fail to encode invalid TIMs input potentially generated by bug or error in incoming JSON
2 - What process will be used to investigate potential “bad” data detection alerts to initiate a Data Incident Response?
The proposed AWS Lambda corrupt-data "canary" will function both in test and Prod DataHub storage S3 buckets, running every day to check for data for contextual validity. Upon a failure of this check, the Lambda function will automatically generate a report of the affected data and the checks failed, and will email that report to the relevant parties.
Upon receipt of this email alert, the appropriate parties will begin using both manual and automated inspection tools to inspect the alerted data and confirm the corruption. The investigators will work to isolate the affected data and the cause of the corruption, and then begin the remediation process. Part of this process will include creating a post-mortem report.
3 - What is the Data Incident Response protocol for data set maintainers, including initial notification and communications
a. Impact analysis / triage?
Data incident responders will begin identifying the scope of the corruption and the cause.
b. Defect resolution?
Data incident responders will alert the data store maintainers to coordinate suspension of data depositing and inform all potentially impacted systems and projects. Corrupt data MAY then be removed. When a bug fix is ready, the fix will be deployed and the invalid data shall be re-uploaded.
c. Data Set impact analysis?
Responders will use both manual inspection and automated tools to identify the full impact of the bug.
d. Data Set corrective action planning?
Once the cause is identified, a post-mortem will be created. The post-mortem document will be used to discuss how the data could be caught, mitigated, and prevented in the future.
e. Analysis of additional preventative measures?
Upon identification of the cause, preventative measures will be contemplated and enacted.
Action & Decision Items from Sprint Review/ Planning 3/6/19
Item | Title | Description | Owner | Target Date |
---|---|---|---|---|
1 | Action: Detail separate approaches for TIMs cleanup and BSM cleanup to have most expedient and low-risk approach for each data type | Since Broadcast TIMs were not affected by the Metabugs and are impossible for WYDOT to re-upload, TIMs cleanup will require some more checks for removing only the affected Received TIMs in impacted data stores, while BSMs can likely be removed and re-uploaded in the impacted data stores in a simpler process with fewer steps. Update the "2. What process will be used to identify, isolate, and remove invalid data?" section above to detail the approach for TIMs and BSM data types separately with concurrence from WYDOT, DataHub, and SDC to ensure most expedient and low-risk approaches and processes are optimized for clean-up of each data type. Cleanup processes should be result of collaboration between WYDOT, ODE, DataHub, and SDC and employ one set of shared validation tools for data store cleanup. Updated document above needs to be sent to Ariel before 4:30 pm eastern time on 3/8/19. |
| 3/8/19 |
2 | Action: Consolidate full analysis, impact, recovery and prevention for both Metadata bug events in one document (above) to support a single cleanup process | Since both Metadata bug events are related the document above will be updated with additional content related to the bsmSource bug. |
| 3/7/19 |
3 | Decision: Do not start the data store cleanup until validated fix for second bug (bsmSource issue) is deployed and complete. | To lower risk of any additional Metadata bug events and avoid having to complete two separate cleanup cycles on the impacted data stores, ODE will implement the approved Quality Assurance measures detailed in ODE-1206, ODE-1207, and ODE-1208 before implementing the fix for the bsmSource bug in ODE-1209. After ODE-1209 is completed, then the data store cleanup process will be completed. |
| 3/6/19 |
4 | Action: Verify that Socrata may not require ANY cleanup actions based on how many records it holds and may "clean itself" in reasonable timeframe | Due to the amount of records that Socrata holds (3 million) and data rate from WYDOT of approximately 2 million records per day, it is likely that old data will be removed automatically and no records impacted by Metadata bugs will be in the system after approximately 36 hours. Need to verify that no cleanup actions are required for Socrata. |
| 3/8/19 |
5 | Action: Immediate notification with timeline communicated to end users | Ensure that end users of the impacted data stores are notified with a reasonable timeline for when data cleanup will be completed and impact to them. This notification needs to be sent before 4:30 pm eastern time on 3/8/19. See notification statement below. |
| 3/8/19 |
6 | Action: Validation feature to be added to data stores' (SDC, WYDOT, DataHub) backlogs | Coordinate between teams (perhaps through scrum of scrum) and get the ingest validation feature added to their respective backlogs utilizing the output validation scripts that the ODE team will develop. The scripts should be reusable for each team’s data ingest validation process (e.g., for SDC, WYDOT, Data Hub) to ensure that they validate all data coming in. |
| 3/8/19 |
Notification
Data Defect Notification (3/8/2019)
Statement for end users of data stores - To be posted to relevant end user area for WYDOT, DataHub, DTG, and SDC no later than 4:30 pm eastern time on 3/8/19:
We would like to bring to your attention some errors in the Wyoming metadata fields described below:
Data Type | Inconsistency Observed | Deposit Time Frame |
---|---|---|
| Erroneous metadata observed in the following fields:
| 12/3/2018 - 2/12/2019 |
| Missing metadata observed in the following field:
| 2/13/2019 -
|
All payload data included in BSM and TIM messages as well as the entire metadata for TIM Broadcast messages (those with metadata.recordGeneratedBy = TMC) are unaffected and entirely correct. We expect that these metadata errors be fully corrected in a few weeks and all invalid metadata be removed shortly after.
We appreciate your patience as we work through these metadata errors and if you have any question about these issues, please feel free to contact the Dataset Owner in the normal support channels.
For DataHub, we recommend the following short statement be added to Socrata description which will appear on here on DTG
“We would like to bring to your attention some errors in the metadata fields for Wyoming dataset deposited from 12/3/2018 – present, which we are working to fix
Please see full details in the ITS DataHub data sandbox”
.
The ITS DataHub data sandbox will contain the full message as shown above, with portions referring to TIM removed as DataHub TIM data is unaffected.
For SDC and Wyoming, an email will be sent to data users to inform them of the metadata errors.
Data Cleanup Complete Notification (4/25/2019)
Below is a proposed comm tailored specifically for DataHub. The comm going out toSDC and WYDOT will be similar. The highlighted wording signifies customization for each specific user group, i.e. SDC and WYDOT. Please review the language and reply to all with your feedback.
This is an update regarding errors in the Wyoming metadata that was communicated to the user community on 3/8/2019. The bug manifested itself in incorrect metadata fields in the Wyoming Basic Safety Message (BSM) and Traveler Information Message (TIM) data. All payload data included in BSM and TIM are unaffected and entirely correct.
The ODE software version that contributed to the Metadata field errors has been corrected and deployed to Wyoming production server on 4/11/2019. No more invalid data is being deposited to DataHub as of 4/12/2019. All existing invalid data have also been removed. The historical data will be restored in the upcoming days and weeks. A final communication will be made once all data restoration has been completed.
We appreciate your patience as we work through resolving these metadata errors and if you have any question about these issues, please feel free tocontact RDAE_Support@bah.com.
Final Notification (7/2019)
Below is a proposed comm tailored specifically for DataHub. The comm going out toSDC and WYDOT will be similar. The highlighted wording signifies customization for each specific user group, i.e. SDC and WYDOT. Please review the language and reply to all with your feedback.
This is an update regarding errors in the Wyoming metadata that was communicated to the user community on 3/8/2019. The bug manifested itself in incorrect metadata fields in the Wyoming Basic Safety Message (BSM) and Traveler Information Message (TIM) data. All payload data included in BSM and TIM are unaffected and entirely correct.
As noted previously, the ODE software version that contributed to the Metadata field errors has been corrected and deployed to Wyoming production server on 4/11/2019. No more invalid data is being deposited to DataHub as of 4/12/2019. All existing invalid data have been removed.
Restoration of the historical data was confirmed as of 7/12/2019. Furthermore, we will be reorganizing the Wyoming data folders in the ITS Sandbox to be based on generatedAt timestamp instead of receivedAt timestamp in the coming weeks to ensure easier retrieval of data.
We appreciate your patience as we work through resolving these metadata errors. If you have any question about these issues or have concerns about the reorganization of ITS Sandbox data, please contact RDAE_Support@bah.com.