Interview: Forensic Logic CTO, Ronald Mayer
May 20, 2011
Introduction
Ronald Mayer has spent his career with technology start-ups in a number of fields ranging from medical devices to digital video to law enforcement software. Ron has also been involved in Open Source for decades, with code that has been incorporated in the LAME MP3 library, the PostgreSQL database, and the PostGIS geospatial extension. His most recent speaking engagement was when he gave a presentation on a broader aspect of this system to the SD Forum’s Emerging Tech SIG titled “Fighting Crime: Information Choke Points & New Software Solutions.” His Lucene Revolution talk is at http://lucenerevolution.org/2011/sessions-day-2#highly-mayer.
Ronald Mayer, Forensic Logic
The Interview
When did you become interested in text and content processing?
I’ve been involved in crime analysis with Forensic Logic for the past eight years. It quickly became apparent that while a lot of law enforcement information is kept in structured database fields, often richer information is in their text narratives, word documents on their desktops, or internal email lists. Police officers are all-to-familiar with long structured search forms for looking stuff up in their systems that are built on top of relational databases. And there are adequate text-search utilities for searching the narratives in their various systems one at a time. And separate text-search utilities for searching their mailing lists. But what they really need is something as simple as Google that works well on all the information they’re interested in–both their structured and unstructured content–both their internal data documents and ones from other sources; so we set out to build one.
What is it about Lucene/Solr that most interests you, particularly as it relates to some of the unique complexity law enforcement search poses?
The flexibility of Lucene and Solr interest are what really attracted me to Solr. There are many factors that contribute to how relevant a search is to a law enforcement user. Obviously traditional text-search factors like keyword density, and exact phrase matches matter. How long ago an incident occurred is important (a recent similar crime is more interesting than a long-ago similar crime). And location is important too. Most police officers are likely to be more interested in crimes that happen in their jurisdiction or neighboring ones. However, a state agent focused on alcoholic beverage licenses may want to search for incidents from anywhere in a state but may be most interested in ones that are at or near bars. The quality of the data makes things interesting too. Victims often have vague descriptions of offenders, and suspects lie. We try to program our system so that a search for “a tall thin teen male” will match an incident mentioning “a 6’3″ 150lb 17 year old boy.” There’s been a steady emergence of information technology in law enforcement, such as in New York City’s CompStat.
What are the major issues in this realm, from an information retrieval processing perspective?
We’ve had meetings with the NYPD’s CompStat group, and they have inspired a number of features in our software including powering the CompStat reports for some of our customers. One of the biggest issues in law enforcement data today is bringing together data from different sources and making sense of it. These sources could be from different systems within a single agency like records management and CAD (Computer Aided Dispatch) systems and internal agency email lists – or groups of cities sharing data with each other – or federal agencies sharing data with state and local agencies.
Is this a matter of finding new information of interest in law enforcement and security? Or is it about integrating the information that’s already there? Put differently, is it about connecting the dots you already have, or finding new dots in new places?
Both. Much of the work we’re doing is connecting dots between data from two different agencies; or two different software systems from within a single agency. But we’re also indexing a number of non-obvious sources as well. One interesting example is a person who was recently found in our software, and one of the better documents describing a gang he’s potentially associated with a Web page about one of his relatives in Wikipedia.
You’ve contributed to Lucene/Solr. How has the community aspect of open source helped you do your job better, and how do you think it has helped other people as well?
It’s a bit early to say I’ve contributed – while I posted my patch to their issue tracking Web site, last I checked it hadn’t been integrated yet. There are a couple users who mentioned to me and the mailing lists that they are using it and would like to see it merged. The community help has been incredible. One example is when we started a project to make a minimal simple user interface to let novice users find agency documents. We noticed that the University of Virginia/Stanford/etc.’s Project Blacklight which is a beautiful library search product built on Solr/Lucene. Our needs for one of our products weren’t too different – just for an internal collection of documents with a few additional facets. With that as a starting point we had a working prototype in a few man-days of work; and a product in a few months.
What are some new or different uses you would like to see evolve within search?
I’d be interesting if the search phrases can be aware of what adjectives go with which nouns. For example a phrase like
‘a tall white male with brown hair and blue eyes and
a short asian female with black hair and brown eyes’
should be a very close match to a document that says
‘blue eyed brown haired tall white male; brown eyed
black haired short asian female’
Solr’s edismax’s “pf2” and “pf3” can do quite a good job at this by considering the distance between words, but note that in the latter document the “brown eyes” clause is nearer to the male than the female; so there’s some room for improvement. I’d like to see some improved spatial features as well. Right now we use a single location in a document to help sort how relevant it might be to a user (incident’s close to a user’s agency are often more interesting than ones half way across the country). But some documents may be highly relevant in multiple different locations, like a drug trafficking ring operating between Dallas and Oakland.
When someone asks you why you don’t use a commercial search solution, what do you tell them?
I tell them that where appropriate, we also use commercial search solutions. For our analysis and reporting product that works mostly with structured data we use a commercial text search solution because it integrates well with the relational tables that also filter results for such reporting. The place where solr/lucene’s flexibility really shined for us is in our product that brings structured, semi-structured, and totally unstructured data together.
What are the benefits to a commercial organization or a government agency when working with your firm? How does an engagement for Forensic Logic move through its life cycle?
Our software is used to power the Law Enforcement Analysis Portal (LEAP) project which is a software-as-a-services platform for law enforcement tools not unlike Salesforce.com is for sales software. The project started in Texas and has recently expanded to include agencies from other states and the federal government. Rather than engaging us directly, a government agency would engage with the LEAP Advisory Board, which is a group of chiefs of police, sheriffs, and state and federal law enforcement officials. We provide some of the domain-specific software, while other partners such as Sungard manage some operations and other software and hardware vendors provide their support. The benefits of government agencies working with us are similar to the benefits of an enterprise working with Salesforce.com – leading edge tools without having to buy expensive equipment and software and manage it internally.
One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content and the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume scaling) challenge? What is the latency for index updates? Can law enforcement and public security agencies use this technology to deal with updates from high-throughput sources like Twitter? Or is the signal-to-noise ratio too weak to make it worth the effort?
In most cases when a record is updated in an agency’s records management system, the change pushed to our system in a few minutes. For some agencies – mostly with older mainframe based systems, the integration’s a nightly batch job. We don’t yet handle high-throughput sources like Twitter. License plate readers on freeways are probably the highest throughput data source we’re integrating today. But we strongly believe it is worth the effort to handle the high-throughput sources like Twitter, and that it’s our software’s job to deal with the signal-to-noise challenges you mentioned to try to present more signal than noise to the end user.
Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What’s your firm’s approach to presenting “outputs” for end user reuse or for mobile access? Is there native support in Lucid Imagination for results formats?
Visualization’s very important to law enforcement; with crime mapping and reporting being very common needs. We have a number of visualization tools like interactive crime maps, heat maps, charts, time lines, and link diagrams built into our software, and we also expose XML Web services to let our customers integrate their own visualization tools. Some of our products were designed with mobile access in mind. Others have such complex user interfaces you really want a keyboard.
There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are “cut off” from access to more robust systems? How do you see the computing world over the next 12 to 18 months?
I think the move to mobile devices is *especially* true in law enforcement. For decades most officers have “searched” their systems by using the radio they carry to verbally ask for information about people and property. It’s a natural transition for them to do this on a phone or iPad instead. Similarly, their data entry is often done first in paper in the field, and then re-data-entered into computers. One agency we work with will be getting iPads for each of their officers to replace both of those. We agree that serious computing infrastructures are needed, but our customers don’t want to manage those themselves. Better if an SaaS vendor manages a robust system, and what better devices than iPads and phones to access it. That said, for some kinds of analysis a powerful workstation is useful, so good SaaS vendors will provide Web services so customers can pull whatever data they need into their other applications.
Put on your wizard hat. What are the three most significant technologies that you see affecting your search business? How will your company respond?
Entity extraction from text documents is improving all the time; so soon we’ll be able to distinguish if a paragraph mentioning “Tom Green” is talking about a person or the county in Texas. For certain types of data we integrate, XML standards for information sharing such as the National Information Exchange Model are finally gaining momentum. As more software vendors support it, it’ll make it easier to inter-operate with other systems. Rich-media processing–like facial recognition, license plate reading, OCR, etc.–are making new media types searchable and analyzable as well.
I note that you’re speaking at the Lucene Revolution conference. What effect is open source search having in your space? I note that the term ‘open source intelligence’ doesn’t really overlap with ‘open source software’. What do you think the public sector can learn from the world of open source search applications, and vice versa?
Many of the better tools are open source tools. In addition to Lucene/Solr, I’d note that the PostGIS extension to the PostgreSQL database is leading the commercial implementations of geospatial tools in some ways. That said, there are excellent commercial tools too. We’re not fanatic either way. Open Source Intelligence is important as well; and we’re working with universities to bring some of the collected research that they do on organized crime and gangs into our system. Regarding learning experiences? I think the big lesson is that easy collaboration is a very powerful tool – whether it’s sharing source code or sharing documents and data.
Lucene/Solr seems to have matured significantly in recent years, achieving a following large and sophisticated enough to merit a national conference dedicated to the open source projects, Lucene Revolution. What advice do you have for people who are interested in adopting open source search, but don’t know where to begin?
If they’re interested, one of the easiest ways to begin is to just try it. On Linux you can probably install it with your OS’s standard package manager with a command like “apt-get install solr-jetty” or similar. If they have a particular need in mind, they might want to look if someone already built a Lucene/Solr powered application similar to their needs. For example, we wanted a searchable index for a set of publications/documents, and Project Blacklight gave us a huge head start.
David Fishman, May 20, 2011
Post sponsored by Lucid Imagination. Posted by Stephen E Arnold