Exclusive Interview with Margie Hlava, Access Innovations

July 19, 2011

Access Innovations has been a leader in the indexing, thesaurus, and value-added content processing space for more than 30 years. Her company has worked for most of the major commercial database publishers, the US government, and a number of professional societies.

image

See www.accessinn.com for more information about MAI and the firm’s other products and services.

When I worked at the database unit of the Courier-Journal & Louisville Times, we relied on Access Innovations for a number of services, including thesaurus guidance. Her firm’s MAI system and its supporting products deliver what most of the newly-minted “discovery” systems need. Indexing that is accurate, consistent, and makes it easy for a user to find the information needed to answer a research or consumer level question. What few realize is that using the systems and methods developed by the taxonomy experts at Access Innovations is the value of standards. Specifically, the Access Innovations’ approach generates an ANSI standard term list. Without getting bogged down in details, the notion of an ANSI compliant controlled term list embodies logical consistency and adherence to strict technical requirements. See the Z39.19 ANSI/NISO standard. Most of the 20 somethings hacking away at indexing fall far short of the quality of the Access Innovations’ implementations. Quality? Not in my book. Give me the Access Innovations (Data Harmony) approach.

Care to argue? I think you need to read the full interview with Margie Hlava in the ArnoldIT.com Search Wizards Speak series. Then we can interact enthusiastically.

On a rare visit to Louisville, Kentucky, on July 15, 2011, I was able to talk with Ms. Hlava about the explosion of interest in high quality content tagging, the New Age word for indexing. Our conversation covered the roots of indexing to the future of systems which will be available from Access Innovations in the next few months.

Let me highlight three points from our conversation, interview, and enthusiastic discussion. (How often do I in rural Kentucky get to interact with one of the, if not the, leading figure in taxonomy development and smart, automated indexing? Answer: Not often enough.)

First, I asked how her firm fit into the landscape of search and retrieval?

She said:

I have always been fascinated with logic and the application of it to the search algorithms was a perfect match for my intellectual interests. When people have an information need, I believe there are three levels to the resources which will satisfy them. First, the person may just need a fact checked. For this they can use encyclopedia, dictionary etc. Second, the person needs what I call “discovery.” There is no simple factual answer and one needs to be created or inferred. This often leads to a research project and it is certainly the beginning point for research. Third, the person needs updating, what has happened since I last gathered all the information available. Ninety five percent of search is either number one or number two. These three levels are critical to answering properly the user questions and determining what kind of search will support their needs. Our focus is to change search to found.

Second, I probed why is indexing such a hot topic?

She said:

Indexing, which I define as the tagging of records with controlled vocabularies, is not new. Indexing has been around since before Cutter and Dewey. My hunch is that librarians in Ephesus put tags on scrolls thousands of years ago. What is different is that it is now widely recognized that search is better with the addition of controlled vocabularies. The use of classification systems, subject headings, thesauri and authority files certainly has been around for a long time. When we were just searching the abstract or a summary, the need was not as great because those content objects are often tightly written. The hard sciences went online first and STM [scientific, technical, medical] content is more likely to use the same terms worldwide for the same things. The coming online of social sciences, business information, popular literature and especially full text has made search overwhelming, inaccurate, and frustrating. I know that you have reported that more than half the users of an enterprise search system are dissatisfied with that system. I hear complaints about people struggling with Bing and Google.

Third, I queried her about her firm’s approach, which I know to be anchored in personal service and obsessive attention to detail to ensure the client’s system delivers exactly what the client wants and needs.

She said:

The data processed by our systems are flexible and free to move. The data are portable. The format is flexible. The interfaces are tailored to the content via the DTD for the client’s data.  We do not need to do special programming. Our clients can use our system and perform virtually all of the metadata tasks themselves through our systems’ administrative module. The user interface is intuitive. Of course, we would do the work for a client as well. We developed the software for our own needs and that includes needing to be up running and in production on a new project very quickly. Access Innovations does not get paid for down time. So our staff are are trained. The application can be set up, fine tuned, deployed in production mode in two weeks or less. Some installations can take a bit longer. But as soon as we have a DTD, we can have the XML application up in two hours. We can create a taxonomy really quickly as well. So the benefits, are fast, flexible, accurate, high quality, and fun!

You will want to read the complete interview with Ms. Hlava. Skip the pretend experts in indexing and taxonomy. The interview answers the question, “Where’s the beef in the taxonomy burger?”

Answer: http://www.arnoldit.com/search-wizards-speak/access-innovations.html

Stephen E Arnold, July 19, 2011

It pains me to say it, but this is a freebie.

India Based Call Centers: The Worker Perspective

July 16, 2011

ReadWrite Enterprise asks, “What’s It Like to Work in an Indian Call Center?” For an answer, writer Klint Finley turns to a Mother Jones article by an American writer, Andrew Marantz, who sampled Indian Call Center training.

With regard to overseas call centers, ReadWrite generally focuses on is whether businesses should invest them now or wait until working conditions and customer service quality improve. That question remains unanswered here, but this piece does call attention to the workers’ perspective.

Finley summarizes:

“Much of the article revolves around the cultural impact that the business process outsourcing industry has brought to India, both good (more economic opportunities for women) and bad (the potential stifling of Indian culture as call center workers attempt to conceal their identities). Here are a few interesting points:

Other key points we noted:

  • There are almost as many women as men working in the call centers.
  • Many of these workers are college educated, but are doing very basic work.
  • Some workers are encouraged to eat American fast food and listen to American music, even on the weekends.”

Treatment of Indian workers by their employers eager to westernize them is a tangled thicket.

How does this relate to search? Three points of contact.

First, search vendors talk about improving customer service. What seems to be more accurate is reducing the costs of support for the company offering help to its customers.

Second, it is not clear in my mind that brute force indexing or even more sophisticated systems do much to address the disconnect between what a customer needs and what is available to answer the question. If the answer is not in the processed content, who is kidding whom?

Third, the notion of improving a customer interface sounds great in a meeting. But the actual implementation is usually more about preventing the customer from contacting the company saying it is “customer facing” and “committed to excellence in customer support.”

Enterprise search vendors know how to address these issues. Right?

Cynthia Murrell July 16, 2011

Cheerleading for Google+ Will Not Drown Out Foundem

July 12, 2011

The noise about the two week old Google baby, Google+ is loud, heavy metal loud. But I am not sure Google+ will drown out the peeps from the Foundem matter. This firm is a shop and compare service in the UK. It’s pretty good, but the firm alleges that Google’s method of indexing did some harm to the Foundem traffic.

Foundem Takes on Google’s Search Methods,” relates the San Francisco Chronicle’s SFGate. Here we go again– more legal pressure over Google’s search practices.

The small British shopping comparison site Foundem was one complainant who prompted last year’s European Commission antitrust probe against Google. The site’s representatives have also spoken to U.S. antitrust agencies, including the Federal Trade Commission. That agency currently has Google under investigation.

Foundem’s complaints stem from the period between 2006 and 2009, when its Google rankings were so low as to be nearly absent. The article elaborates on their perspective:

Specifically, Foundem says that Google tweaked its search algorithm to give a lower ranking to sites that had little original content and were mostly designed to send users to other places on the Web. Such a change sounds like a reasonable effort to filter out low-quality sites, but it had the effect of eliminating from results those vertical-search engines that in one manner or another compete with Google, the company maintains. After all, sites without much original content designed to send users elsewhere is basically the definition of any search engine, including Google.

Foundem also alleges that Google’s prominent placement of its own products such as Maps and YouTube videos gives it an unfair advantage.

For its part, Google portrays Foundem as a low-value site that deserved its low rankings.

More and more challenges of this type are sure to add fuel to the U.S. investigation of Google’s alleged methods. We’re intrigued to see how it all plays out; stay tuned. I am not sure Foundem is going to get lost. Just a hunch.

Stephen E Arnold, July 12, 2011

From the leader in next-generation analysis of search and content processing, Beyond Search.

Microsoft and Its Different Search Systems

July 12, 2011

We noted SharePoint Geek’s useful blog post “Comparing SharePoint 2010 Search: Foundation vs Server vs FAST.” The table presents in a very succinct manner the three main search solutions available from Microsoft. In fact, this table is something we suggest you tuck into your SharePoint Reference folder, which was put together by the editor at LearningSharePoint.com.

Let’s take a quick look at the three search systems available from Microsoft. Search Technologies has significant experience with each of these within our Microsoft Search Practice, and we find them useful within the design and configuration constraints which Microsoft’s engineers have defined for each system.

SharePoint Foundation 2010 is what we call “basic key word search.” The product is included with SharePoint 2010. It does a solid job of indexing content within a properly configured SharePoint installation. If you are a small business with two or three people who need access to shared content, SharePoint Foundation is going to be a logical choice.

The upgrade is the search function in SharePoint Server 2010. In a nutshell, the basic key word search and intranet indexing is similar to that in Foundation. Additional features provided with this Microsoft search system include:

  • An entity search which is optimized for people
  • A query federation function which allows content from different intranet sources to be combined in one results list.
  • Graphical administrative interface.

A basic “suggested search” or “see also” function is available as well. This search system may meet the needs of most small businesses. If you need to access external content, you will want to upgrade to the Fast Search system.

The features of the Fast solution include:

  • Basic search
  • A document preview function so the application does not have to launched to view the content
  • Intranet indexing
  • Indexing of Web and third party content not within the licensee’s SharePoint repository
  • Concatenated results lists; that is, information from multiple collections and sources
  • A graphical administrative tool
  • Faceted search.

Our view is that if you implement the SharePoint Server 2010 or Fast Search Server 2010, you may want to enlist the support of a company like Search Technologies. There are three reasons:

First, engineers working on SharePoint without deep experience in search will benefit from the expertise obtained through dozens and dozens of SharePoint Search and Fast Search deployments.

Second, the optimization techniques that a firm such as Search Technologies possesses often allow a SharePoint licensee to maximize performance without the need to scale up and out.

Third, the customization functions are rich; however, some of the methods for fine tuning certain features often require custom scripting or the use of methods not required for SQL Server or Exchange. Therefore, a third party can reduce the time, cost, and frustration of adding the final touches to a SharePoint “findability” solution.

Please, navigate to www.searchtechnologies.com to learn more about our expertise in deploying Microsoft’s search solutions.

Iain Fletcher, July 12, 2011

Search Technologies

Google Lobbyist Factoid

July 6, 2011

Short honk: Not sure if this factoid is on the money, but I found it interesting. Navigate to “Google Trebles Its Force of Lobbyists Ahead of FTC Probe.” According to the write up, Google now has 18 lobbying firms assisting the world’s biggest brute force indexing system to avoid hassles with the US Federal  Trade Commission. I think that 18 is an interesting number. It is the average age of Match Club members who engage in arguments about the best way to perform mental arithmetic. Also, 18 is the number of lobbyists required to explain that Google’s “life” is “one click away” threat from extinction by such outfits as Blekko.com, IceRocket.com, Ixquick.com, and, of course, Bing.com. Most companies need 18 lobbying firms. Modest investment.

Stephen E Arnold, July 6, 2011

The addled goose is the author of The New Landscape of Enterprise Search, and it is only $20. Such a deal.

MeauxSoft: Free Search Tool for Your Hard Drive

July 4, 2011

Mo-Search Puts Your Computer’s Data at Your Fingertips,” offers MeauxSoft. Version 4.0.4 of Mo-Search was just released. It and earlier versions can be downloaded from this page. Though the downloads are free, MeauxSoft suggests a donation if you find the tool useful.

Regarding the new version, the write-up lists the advances:

Supports Windows XP and later. Changes include: low overhead AutoIndex, AutoUpdate, new database engine (SQLCE), faster indexing and searching, plus many other bug fixes, optimizations and improvements.

The company boasts that Mo-Search is free of spyware and adware. That’s not a given?

It’s easy to use, providing results that are ranked and sorted. Unlike competitors’ free products, this application allows searching networked drives. Important, that.

Ease of use is enhanced with a file viewer that highlights matches without launching a separate app. Other features such as quick viewing of a file within the application, a find duplicates function, and a point and click interface are useful touches.

Stephen E Arnold, July 4, 2011

Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search

Five Reasons Why SEO Is Going to Lead to Buying Traffic

June 24, 2011

This week I have engaged in five separate conversations with super-bright 30 somethings. The one theme that made these conversations like a five act Shakespearian comedy was SEO or search engine optimization. The focus is on getting traffic, not building a brand or contributing to a higher value conversation.

Google continues to entertain search circus goers with its trained Pandas. These Pandas do some interesting things; for example, the gentle mouthing of the word “panda” causes heart palpitations among the marketers whose jobs depend on boosting Web traffic. Let’s face it. Most Web sites don’t get too much traffic. One company which I am reluctant to name was excited to tell me it had 800 unique visitors in May 2011.

image

Move the world? Maybe. Move a nail salon’s Web traffic? Probably a tough job.

Okay. No problem if the 800 visitors were the global market for the firm’s product. But the 800 included robots, employees, consultants, and the occasional person looking for this firm’s specific type of archiving software.

With Web site costs creeping upwards, bean counters want to know what the money is delivering. The answer in many cases is, “More costs.”

Not good news for expensive, essentially unvisited, Web sites. The painful fact of life is that among the billions of Web pages, micro sites, blogs, and whatever has a url only most get lousy traffic.

Archimedes, by way of Yale, said, “Give me a lever big enough and I’ll move the world.” The world? Maybe. Traffic to a vacuum cleaner repair shop in Prospect, Kentucky? Not a chance.

Pumping up traffic to a tire store or a nail salon or even a whizzy Internet marketing company is a tough job. I gave up on traffic after we did The Point (Top 5% of the Internet) right before we sold the property to CMGI. What the heck was traffic? What could or should one count? Robots? Inadvertent clicks?

That experience contributed to my skepticism about reports about how many visitors a site has.

Google Quietly Launches Panda Update Version 2.2” is a good write up about the fearsome Panda. Like A Nightmare on Elm Street, the Panda keeps on coming, wrecking weekends for traffic crazed marketers. Bummer. I learned:

Supposedly, one thing Google was going to address with Panda 2.2 is the issue of scraper sites – websites that republish other people’s content on their own site, usually making money from Google AdSense in the process – outranking content originators. As Frank Watson noted, "Google created the mechanism that clogs its own data centers and overwhelms its own spam battlers."

Ah, Google as the prime mover and its nemesis.

Now the five reasons:

  1. Google will offer sites a way to get traffic. Buy more Adwords. Simple.
  2. Traditional Web sites are not the preferred way to get information in some demographic segments; for example, those under 20.
  3. Social networks are not only better than results lists; social networks are curated. Selection is better than relevance determined by tricks.
  4. Content is proliferating so brute force indexes are having to take short cuts to generate outputs. Those outputs are becoming less and less useful because other methods of finding are fresher and more likely to be on target
  5. Users don’t know or care about the provenance of certain types of content. Accuracy? Who has time to double or triple check. Uncurated results can be spoofed.

A tip of the hat to the SEO experts. Most of the relevance problems in the major brute force indexes are directly attributable to both the indexing companies and the SEO professionals.

So what about the users? Eureka. Ask one’s social network, Facebook.

Stephen E Arnold, June 23, 2011

Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion

Will Google Do Real News?

June 23, 2011

Will Google do “real” news? I read “Salon CEO Gingras Resigns to Become Global Head of News Products at Google.” I think this is a fascinating action on the part of Google. In Google: The Digital Gutenberg, published by Infonortics Ltd. in 2009, I looked at Google’s content technology. My focus was not on indexing. I reviewed the parse, tag, chop, and reassemble systems and methods that Google’s wizards had invented. The monograph is available at this link. The monograph may be useful for anyone who wants to understand what happens when “real” journalists get access to the goodies in the Googleplex. In addition, to Odwalla beverages, the Google open source documents suggest that snippets of text and facts can be automatically assembled into outputs that one could describe as “reports” or “new information objects.” Sure, a human is needed in some of these processes, but Google uses lots of humans. Its public relations machine and liberal mouse pad distribution policy helps keep the myth alive that Google is all math all the time. Not exactly accurate.

The write up says:

The new position as the senior executive overseeing Google News, as well as other products that may be in the pipeline, comes several years after Gingras worked as a consultant at the Mountain View campus, focusing on ways the search giant could improve its news products.

What will come from a “real” journalist getting a chance to learn about some of the auto assembly technology? I offer some ideas in my Digital Gutenberg monograph. Publishers may want to ponder this idea as well. Google is more than search, and we are going to learn more about its intentions in the near future.

Five years ago, when Google was at the top of its game, I would have had little hesitation to give Google a better than 50 percent chance of success. Now with the Amazon, Apple, and Facebook environment, I am not so sure. Google has been relying more on buying stuff that works and playing a hard game of “Me Too.”

With the most recent reworking of Google News, I find myself turning to Pulse, Yahoo News, and NewsNow.co.uk. Am I alone?

Stephen E Arnold, June 23, 2011

From the leader in next-generation analysis of search and content processing, Beyond Search.

ProQuest: A Typo or Marketing?

June 10, 2011

I was poking around with the bound phrase “deep indexing.” I had a briefing from a start up called Correlation Concepts. The conversation focused on the firm’s method of figuring out relationships among concepts within text documents. If you want to know more about Correlation Concepts, you can get more information from the firm’s Web site at http://goo.gl/gnBz6.

I mentioned to Correlation Concepts Dr. Zbigniew Michalewicz’s work in mereology and genetic algorithms and also referenced the deep extraction methods developed by Dr. David Bean at Attensity. I also commented on some of the methods disclosed in Google’s open source content. But Google has become less interesting to me as new approaches have become known to me. Deep extraction requires focus, and I find it difficult to reconcile focus with the paint gun approach Google is now taking in disciplines far removed from my narrow area of interest.

image

A typo is a typo. An intentional mistake may be a joke or maybe disinformation. Source: http://thiiran-muru-arul.blogspot.com/2010/11/dealing-with-mistakes.html

After the interesting demo given to me by Correlation Concepts, I did some patent surfing. I use a number of tools to find, crunch, and figure out which crazily worded filing relates to other, equally crazily worded documents. I don’t think the patent system is much more than an exotic work of fiction and fancy similar to Spenser’s The Faerie Queene.

Deep indexing is important. Key word indexing does not capture in some cases the “aboutness” of a document. As metadata becomes more important, indexing outfits have to cut costs. Human indexers are like tall grass in an upscale subdivision. Someone is going to trim that surplus. In indexing, humans get pushed out for fancy automated systems. Initially more expensive than humans, the automated systems don’t require retirement, health care, or much management. The problem is that humans still index certain content better than automated systems. Toss out high quality indexing and insert algorithmic methods, and you get search results which can vary from indexing update to indexing update.

Read more

Interview: Forensic Logic CTO, Ronald Mayer

May 20, 2011

Introduction

Ronald Mayer has spent his career with technology start-ups in a number of fields ranging from medical devices to digital video to law enforcement software.    Ron has also been involved in Open Source for decades, with code that has been incorporated in the LAME MP3 library, the PostgreSQL database, and the PostGIS geospatial extension. His most recent speaking engagement was when he gave a presentation on a broader aspect of this system to the SD Forum’s Emerging Tech SIG titled “Fighting Crime: Information Choke Points & New Software Solutions.” His Lucene Revolution talk is at http://lucenerevolution.org/2011/sessions-day-2#highly-mayer.

mugshot_ron_200x200

Ronald Mayer, Forensic Logic

The Interview

When did you become interested in text and  content processing?

I’ve been involved in crime analysis with Forensic Logic for the past eight years.  It quickly became apparent that while a lot of law enforcement information is kept in structured database fields, often richer information is in their text narratives, word documents on their desktops, or internal email lists. Police officers are all-to-familiar with long structured search forms for looking stuff up in their systems that are built on top of relational databases.  And there are adequate  text-search utilities for searching the narratives  in their various systems one at a time.   And separate  text-search utilities for searching their mailing lists. But what they really need is something as simple as Google that works well on all the information they’re interested in–both their structured and unstructured content–both their internal data documents and ones from other sources; so we set out to build one.

What is it about Lucene/Solr that most interests you, particularly as it relates to some of the unique complexity law enforcement search poses?

The flexibility of Lucene and Solr interest are what really attracted me to Solr.  There are many factors that contribute to how relevant a search is to a law enforcement user. Obviously traditional text-search factors like keyword density, and exact phrase matches matter.   How long ago an incident occurred is important (a recent similar crime is more interesting than a long-ago similar crime). And location is important too.   Most police officers are likely to be more interested in crimes that happen in their jurisdiction or neighboring ones.   However, a state agent focused on alcoholic beverage licenses may want to search for incidents from anywhere in a state but may be most interested in ones that are at or near bars. The quality of the data makes things interesting too. Victims often have vague descriptions of offenders, and suspects lie.   We try to program our system so that a search for “a tall thin teen male” will match an incident mentioning “a 6’3″ 150lb 17 year old boy.” There’s been a steady emergence of information technology in law enforcement, such as in New York City’s CompStat.

What are the major issues in this realm, from an information retrieval processing perspective?

We’ve had meetings with the NYPD’s CompStat group, and they have inspired a number of features in our software including powering the CompStat reports for some of our customers. One of the biggest issues in law enforcement data today is bringing together data from different sources and making sense of it.   These sources could be from different systems within a single agency like records management and CAD (Computer Aided Dispatch) systems and internal agency email lists – or groups of cities sharing data with each other – or federal agencies sharing data with state and local agencies.

Is this a matter of finding new information of interest in law enforcement and security? Or is it about integrating the information that’s already there? Put differently, is it about connecting the dots you already have, or finding new dots in new places?

Both.  Much of the work we’re doing is connecting dots between data from two different agencies; or two different software systems from within a single agency. But we’re also indexing a number of non-obvious sources as well.   One interesting example is a person who was recently found in our software, and one of the better documents describing a gang he’s potentially associated with a Web page about one of his relatives in Wikipedia.

You’ve contributed to Lucene/Solr. How has the community aspect of open source helped you do your job better, and how do you think it has helped other people as well?

It’s a bit early to say I’ve contributed – while I posted my patch to their issue tracking Web site, last I checked it hadn’t been integrated yet.  There are a couple users who mentioned  to me and the mailing lists that they are using it and would like to see it merged. The community help has been incredible.   One example is when we started a project to make a minimal simple user interface to let novice users find agency documents.   We noticed that the University of Virginia/Stanford/etc.’s Project Blacklight which is a beautiful library search product built on Solr/Lucene. Our needs for one of our products weren’t too different –  just for an internal collection of documents with a few  additional facets.   With that as a starting point we had a working prototype in a few man-days of work; and a product in a few months.

What are some new or different uses you would like to see evolve within search?

I’d be interesting if the search phrases can be aware of what adjectives go with which nouns.   For example a phrase like

‘a tall white male with brown hair and blue eyes and
a short asian female with black hair and brown eyes’

should be a very close match to a document that says

‘blue eyed brown haired tall white male; brown eyed
black haired short asian female’

Solr’s edismax’s “pf2” and “pf3” can do quite a good job at this by considering the distance between words, but note that in the latter document the “brown eyes” clause is nearer to the male than the female; so there’s some room for improvement. I’d like to see some improved spatial features as well.     Right now we use a single location in a document to help sort how relevant it might be to a user (incident’s close to a user’s agency are often more interesting than ones half way across the country).  But some documents may be highly relevant in multiple different locations, like a drug trafficking ring operating between Dallas and Oakland.

When someone asks you why you don’t use a commercial search solution, what do you tell them?

I tell them that where appropriate, we also use commercial search solutions.   For our analysis and reporting product  that works mostly with structured data we use a commercial text search solution because it integrates well with the relational tables that also filter results for such reporting. The place where solr/lucene’s flexibility really shined for us is in our product that brings structured, semi-structured, and totally unstructured data together.

What are the benefits to a commercial organization or a government agency when working with your firm? How does an engagement for Forensic Logic move through its life cycle?

Our software is used to power the Law Enforcement Analysis Portal (LEAP) project which is a software-as-a-services platform for law enforcement tools not unlike Salesforce.com is for sales software.    The project started in Texas and has recently expanded to include agencies from other states and the federal government.   Rather than engaging us directly, a government agency would engage with the LEAP Advisory Board, which is a group of chiefs of police, sheriffs, and state  and federal law enforcement officials. We provide some of the domain-specific software, while other partners such as Sungard manage some operations and other software and hardware vendors provide their support. The benefits of government agencies working with us are similar to the benefits of an enterprise working with Salesforce.com – leading edge tools without having to buy expensive equipment and software and manage it internally.

One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content and the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume scaling) challenge? What is the latency for index updates? Can law enforcement and public security agencies use this technology to deal with updates from high-throughput sources like Twitter? Or is the signal-to-noise ratio too weak to make it worth the effort?

In most cases when a record is updated in an agency’s records management system, the change pushed to our system in a few minutes.   For some agencies – mostly with older mainframe based systems, the integration’s a nightly batch job. We don’t yet handle high-throughput sources like Twitter.  License plate readers on freeways are probably the highest throughput data source we’re integrating today. But we strongly believe it is worth the effort to handle the high-throughput sources like Twitter, and that it’s our software’s job to deal with the signal-to-noise challenges you mentioned to try to present more signal than noise to the end user.

Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What’s your firm’s approach to presenting “outputs” for end user reuse or for mobile access? Is there native support in Lucid Imagination for results formats?

Visualization’s very important to law enforcement; with crime mapping and reporting being very common needs.   We have a number of visualization tools like interactive crime maps, heat maps, charts, time lines, and link diagrams built into our software, and we also expose XML Web services to let our customers integrate their own visualization tools. Some of our products were designed with mobile access in mind. Others have such complex user interfaces you really want a keyboard.

There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are “cut off” from access to more robust systems? How do you see the  computing world over the next 12 to 18 months?

I think the move to mobile devices is *especially* true in  law enforcement.   For decades most officers have “searched” their systems by using the radio they carry to verbally ask for information about people and property.  It’s a natural transition for them to do this on a phone or iPad instead. Similarly, their data entry is often done first in paper in the field, and then re-data-entered into computers.   One agency we work with will be getting iPads for each of their officers to replace both of those. We agree that serious computing infrastructures are needed, but our customers don’t want to manage those themselves.  Better if an SaaS vendor manages a robust system, and what better devices than iPads and phones to access it. That said, for some kinds of analysis a powerful workstation is useful, so good SaaS vendors will provide Web services so customers can pull whatever data they need into their other applications.

Put on your wizard hat. What are the three most significant technologies that you see affecting your search business? How will your company respond?

Entity extraction from text documents is improving all the time; so soon we’ll be able to distinguish if a paragraph mentioning “Tom Green” is talking about a person or  the county in Texas. For certain types of data we integrate, XML standards for information sharing such as the National Information Exchange Model are finally gaining momentum.   As more software vendors support it, it’ll make it easier to inter-operate with other systems. Rich-media processing–like facial recognition, license plate reading, OCR, etc.–are making new media types searchable and analyzable as well.

I note that you’re speaking at the Lucene Revolution conference. What effect is open source search having in your space? I note that the term ‘open source intelligence’ doesn’t really overlap with ‘open source software’. What do you think the public sector can learn from the world of open source search applications, and vice versa?

Many of the better tools are open source tools.  In addition to Lucene/Solr, I’d note that the PostGIS extension to the PostgreSQL database is leading the commercial implementations of geospatial tools in some ways.   That said, there are excellent commercial tools too.  We’re not fanatic either way. Open Source Intelligence is important as well; and we’re working with universities to bring some of the collected research that they do on organized crime and gangs into our system. Regarding learning experiences?  I think the big lesson is that easy collaboration is a very powerful tool – whether it’s sharing source code or sharing documents and data.

Lucene/Solr seems to have matured significantly in recent years, achieving a following large and sophisticated enough to merit a national conference dedicated to the open source projects, Lucene Revolution. What advice do you have for people who are interested in adopting open source search, but don’t know where to begin?

If they’re interested, one of the easiest ways to begin is to just try it.  On Linux you can probably install it with your OS’s standard package manager with a command like “apt-get install solr-jetty” or similar. If they have a particular need in mind, they might want to look if someone already built a Lucene/Solr powered application similar to their needs.   For example, we wanted a searchable index for a set of publications/documents, and Project Blacklight gave us a huge head start.

David Fishman, May 20, 2011

Post sponsored by Lucid Imagination. Posted by Stephen E Arnold

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta