Splunk’s go-to add-on for querying SQL databases, DB Connect, has an interesting glitch. Now and then, often as a result of a restart, an input or two (or more) will get disabled. Data flow stops, dashboards fall behind, and inevitably the client takes notice. After having dealt with this, we at Threat Informant sought a method by which we could get alerted by the system, rather than the client.
The method we settled upon was simple in concept, but took a creative flair to actually implement. The idea was that we’d use aggregated run-counts for all inputs, grouped by connection, and compare it to a calculated ideal per 24-hour period. Durations specified in seconds were of course easy to process, but imagine the complexity of converting cron-schedules! As is often the case, we found initial success after writing an intensely long SPL string.
But our success proved dissatisfying. Sure, we’d get an important alert, but it led to a violation of one of Threat Informant’s Core Values: “Don’t reiterate what you can automate.” Why were we the ones restarting these inputs? If we could teach Splunk when it happened, then surely we could teach it how to react! And so, a new project began: Scripted Restart of DB Connect Inputs. And this is where it got fun.
We started off with terrible news. Splunk’s REST API boasts a series of GET endpoints for Modular Inputs, but nothing in the POST area. At a glance, the most it could do is just make our existing alert more efficient. Not good enough! So we decided to go directly to Splunk, and to our delight, there’s an undocumented modular input endpoint for this add-on:
Now, I want to believe you realize how awesome this is, because now we can regularly monitor the status of every input, and in the event of unintended outages, we can make 3 attempts to restart without lifting a finger! (After three strikes, it’s time for the script to ask for human help) On top of that, we built a dashboard to display the current status of every input, and have input details available for the end user when requested! (Ever had someone want to confirm the SQL you used to make the input?) Since this innovation, we’ve never had an input stay down for more than five minutes. The real beauty, though, is how this segues into full Splunk auto-administration, but that is another blog entry…