dtSearch Chases Those Pesky PDFs

September 7, 2015

While predictive analytics and other litigation software are more important than ever for legal professionals to sift through the mounds of documents and discover patterns, several companies have come to the rescue, especially dtSearch.  Inside Counsel explains how a “New dtSearch Release Offers More Support To Lawyers.”

The latest dtSearch release is not only able to search through terabytes of information in online and offline environments, but its documents filters have broadened to search encrypted PDFs, including those with a password.  While PDFs are a universally accepted document format, they are a pain to deal with if they ever have to be edited or are password protected.

Also included in the dtSearch are other beneficial features:

“Additionally, dtSearch products can parse, index, search, display with highlighted hits, and extract content from full-text and metadata in several data types, including: Web-ready content; other databases; MS Office formats; other “Office” formats, PDF, compression formats; emails and attachments; Recursively embedded objects; Terabyte Indexer; and Concurrent, Multithreaded Searching.”

The new PDF search feature with the ability to delve into encrypted PDF files is a huge leap ahead of its rivals, being able to explore PDFs without Adobe Acrobat or another PDF editor will make pursuing through litigation much simpler.

Whitney Grace, September 7, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Shades of CrossZ: Compress Data to Speed Search

September 3, 2015

I have mentioned in my lectures a start up called CrossZ. Before whipping out your smartphone and running a predictive query on the Alphabet GOOG thing, sit tight.

CrossZ hit my radar in 1997. The concept behind the company was to compress extracted chunks of data. The method, as I recall, made use of fractal compression, which was the rage at that time. The queries were converted to fractal tokens. The system then quickly pulled out the needed data and displayed them in human readable form. The approach was called as I recall “QueryObject.” By 2002, the outfit dropped off my radar. The downside of the CrossZ approach was that the compression was asymmetric; that is, slow preparing the fractal chunk but really fast when running a query and extracting the needed data.

Flash forward to Terbium Labs, which has a patent on a method of converting data to tokens or what the firm calls “digital fingerprints.” The system matches patterns and displays high probability matches. Terbium is a high potential outfit. The firm’s methods may be a short cut for some of the Big Data matching tasks some folks in the biology lab have.

For me, the concept of reducing the size of a content chunk and then querying it to achieve faster response time is a good idea.

What do you think I thought when I read “Searching Big Data Faster”? Three notions flitter through my aged mind:

First, the idea is neither new nor revolutionary. Perhaps the MIT implementation is novel? Maybe not?

Second, the main point that “evolution is stingy with good designs” strikes me as a wild and crazy generalization. What about the genome of the octopus, gentle reader?

Third, MIT is darned eager to polish the MIT apple. This is okay as long as the whiz kids take a look at companies which used this method a couple of decades ago.

That is probably not important to anyone but me and to those who came up with the original idea, maybe before CrossZ popped out of Eastern Europe and closed a deal with a large financial services firm years ago.

Stephen E Arnold, September 3, 2015

Dark Web Drug Trade Unfazed by Law Enforcement Crackdowns

September 3, 2015

When Silk Road was taken down in 2013, the Dark Web took a big hit, but it was only a few months before black marketers found alternate means to sell their wares, including illegal drugs.  The Dark Web provides an anonymous and often secure means to purchase everything from heroin to prescription narcotics with, apparently, few worries about the threat of prosecution.  Wired explains that “Crackdowns Haven’t Stopped The Dark Web’s $100M Yearly Drug Sale,” proving that if there is a demand, the Internet will provide a means for illegal sales.

In an effort to determine if the Dark Web have grown to declined, Carnegie Mellon researchers Nicolas Cristin and Kyle Soska studied thirty-five Dark Web markets from 2013 to January 2015.  They discovered that the Dark Web markets are no longer explosively growing, but the market has remained stable fluctuating from $100 million to $180 million a year.

The researchers concluded that the Dark Web market is able to survive any “economic” shifts, including law enforcement crackdowns:

“More surprising, perhaps, is that the Dark Web economy roughly maintains that sales volume even after major disasters like thefts, scams, takedowns, and arrests. According to the Carnegie Mellon data, the market quickly recovered after the Silk Road 2 market lost millions of dollars of users’ bitcoins in an apparent hack or theft. Even law enforcement operations that remove entire marketplaces, as in last year’s purge of half a dozen sites in the Europol/FBI investigation known as Operation Onymous, haven’t dropped the market under $100 million in sales per year.”

Cristin and Soska’s study is the most comprehensive to measure the size and trajectory of the Dark Web’s drug market.  Their study ended prematurely, because two Web sites grew so big that the researchers’ software wasn’t able to track the content.  Their study showed that most Dark Web vendors are using more encryption tools, they make profits less $1000, and they are mostly selling MDMA and marijuana.

Soska and Cristin also argue that the Dark Web drug trade decreases violence in the retail drug trade, i.e. it keeps the transactions digital than having there be more violence on the streets.  They urge law enforcement officials to rethink shutting down the Dark Web markets, because it does not seem to have any effect.

Whitney Grace, September 3, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Does This Autonomous Nerf Gun Herald the Age of Killer Robots?

September 3, 2015

Well here’s something interesting that has arisen from HP’s “disastrous” $11 billion acquisition of Autonomy: check out this three-minute YouTube video: “See What You Can Create with HP IDOL OnDemand.” The fascinating footage reveals the product of developer Martin Zerbib’s “little project,” made possible with IDOL OnDemand and a Nerf gun. Watch as the system targets a specific individual, a greedy pizza grabber, a napping worker, and a thief. It seems like harmless fun, until you realize how gruesome this footage would be if this were a real gun.

It is my opinion that it is the wielders of weapons who should be held directly responsible for their misuse, not the inventors. Still, commenter “Dazed Confused” has a point when he rhetorically asks “What could possibly go wrong?” and links to an article in Bulletin of the Atomic Scientists, “Stopping Killer Robots and Other Future Threats.” That piece describes an agreement being hammered out that proposes to ban the development of fully autonomous weapons. Writer Seth Baum explains there is precedent for such an agreement: The Saint Petersburg Declaration of 1868 banned exploding bullets, and 105 countries have now ratified the 1995 Protocol on Blinding Laser Weapons. (Such laser weapons could inflict permanent blindness on soldiers, it is reasoned.) After conceding that auto-weaponry would have certain advantages, the article points out:

“But the potential downsides are significant. Militaries might kill more if no individual has to bear the emotional burden of strike decisions. Governments might wage more wars if the cost to their soldiers were lower. Oppressive tyrants could turn fully autonomous weapons on their own people when human soldiers refused to obey. And the machines could malfunction—as all machines sometimes do—killing friend and foe alike.

“Robots, moreover, could struggle to recognize unacceptable targets such as civilians and wounded combatants. The sort of advanced pattern recognition required to distinguish one person from another is relatively easy for humans, but difficult to program in a machine. Computers have outperformed humans in things like multiplication for a very long time, but despite great effort, their capacity for face and voice recognition remains crude. Technology would have to overcome this problem in order for robots to avoid killing the wrong people.”

Baum goes on to note that organizers base their call for a ban on existing international humanitarian law, which prohibits weapons that would strike civilians. Such reasoning has already been employed to achieve bans against landmines and cluster munitions, and is being leveraged in an attempt to ban nuclear weapons.

Will killer robots be banned before they’re a reality? It seems the agreement would have to move much faster than bureaucracy usually does; given the public example of Zerbib’s “little project,” I suspect it is already way too late for that.

Cynthia Murrell, September 3, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

 

Watson Speaks Naturally

September 3, 2015

While there are many companies that offer accurate natural language comprehension software, completely understanding the complexities of human language still eludes computers.  IBM reports that it is close to overcoming the natural language barriers with IBM Watson Content Analytics as described in “Discover And Use Real-World Terminology With IBM Watson Content Analytics.”

The tutorial points out that any analytics program that only relies on structured data loses about four fifths of information, which is a big disadvantage in the big data era, especially when insights are supposed to be hidden in the unstructured.  The Watson Content Analytics is a search and analytics platform and it uses rich-text analysis to find extract actionable insights from new sources, such as email, social media, Web content, and databases.

The Watson Content Analytics can be used in two ways:

  • “Immediately use WCA analytics views to derive quick insights from sizeable collections of contents. These views often operate on facets. Facets are significant aspects of the documents that are derived from either metadata that is already structured (for example, date, author, tags) or from concepts that are extracted from textual content.
  • Extracting entities or concepts, for use by WCA analytics view or other downstream solutions. Typical examples include mining physician or lab analysis reports to populate patient records, extracting named entities and relationships to feed investigation software, or defining a typology of sentiments that are expressed on social networks to improve statistical analysis of consumer behavior.”

The tutorial runs through a domain specific terminology application for the Watson Content Analytics.  The application gets very intensive, but it teaches how Watson Content Analytics is possibly beyond the regular big data application.

Whitney Grace, September 3, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Suggestions for Developers to Improve Functionality for Search

September 2, 2015

The article on SiteCrafting titled Maxxcat Pro Tips lays out some guidelines for improved functionality when it comes deep search. Limiting your Crawls is the first suggestion. Since all links are not created equally, it is wise to avoid runaway crawls on links where there will always be a “Next” button. The article suggests hand-selecting the links you want to use. The second tip is Specify Your Snippets. The article explains,

“When MaxxCAT returns search results, each result comes with four pieces of information: url, title, meta, and snippet (a preview of some of the text found at the link). By default, MaxxCAT formulates a snippet by parsing the document, extracting content, and assembling a snippet out of that content. This works well for binary documents… but for webpages you wanted to trim out the content that is repeated on every page (e.g. navigation…) so search results are as accurate as possible.”

The third suggestion is to Implement Meta-Tag Filtering. Each suggestion is followed up with step-by-step instructions. These handy tips come from a partnering between Sitecrafting is a web design company founded in 1995 by Brian Forth. Maxxcat is a company acknowledged for its achievements in high performance search since 2007.

Chelsea Kerwin, September 2, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Maverick Search and Match Platform from Exorbyte

August 31, 2015

The article titled Input Management: Exorbyte Automates the Determination of Identities on Business On (a primarily German language website) promotes the Full Page Entity Detect from Exorbyte. Exorbyte is a world leader in search and match for large volumes of data. They boast clients in government, insurance, input management and ICT firms, really any business with identity resolution needs. The article stresses the importance of pulling information from masses of data in the modern office. They explain,

“With Full Page Entity Detect provides exorbyte a solution to the inbox of several million incoming documents.This identity data of the digitized correspondence (can be used for correspondence definition ) extract with little effort from full-text documents such as letters and emails and efficiently compare them with reference databases. The input management tool combines a high fault tolerance with accuracy, speed and flexibility.Gartner, the software company from Konstanz was recently included in the Magic Quadrant for Enterprise Search.”

The company promises that their Matchmaker technology is unrivaled in searching text without restrictions, even without language, allowing for more accurate search. Full Page Entity Detect is said to be particularly useful when it comes to missing information or overlooked errors, since the search is so thorough.

Chelsea Kerwin, August 31 , 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Beyond Google, How to Work Your Search Engine

August 28, 2015

The article on Funnelback titled Five Ways to Improve Your Website Search offers tips that may seem obvious, but could always stand to be reinforced. Sometimes the Google site:<url> is not enough. The first tip, for example, is simply to be helpful. That means recognizing synonyms and perhaps adding an autocomplete function in case your site users think in different terms than you do. The worst case scenario is search is typing in a term and yielding no results, especially when the problem is just language and the thing being searched for is actually present, just not found. The article goes into the importance of the personal touch as well,

“You can use more than just the user’s search term to inform the results your search engine delivers… For example, if you search for ‘open day’ on a university website, it might be more appropriate to promote and display an ‘International Open Day’ event result to prospective international students instead of your ‘Domestic Student Open Day’ counterpart event. This change in search behavior could be determined by the user’s location – even if it wasn’t part of their original search query.”

The article also suggests learning from the search engine. Obviously, analyzing what customers are most likely to search for on your website will tell you a lot about what sort of marketing is working, and what sort of customers you are attracting. Don’t underestimate search.

Chelsea Kerwin, August 28, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Lexmark: Signs of Trouble?

August 27, 2015

I read “Shares of Lexmark International Inc. Sees Large Outflow of Money.”

The main point of the write up in my opinion was:

The company shares have dropped 41.65% in the past 52 Weeks. On August 25, 2014 The shares registered one year high of $50.63 and one year low was seen on August 21, 2015 at $29.11.

Today as I write this (August 26, 2015), Lexmark is trading at $28.25.

Why do I care?

The company acquired several search and content processing systems in the firm’s effort to find a replacement for the firm’s traditional business, printers. As you know, Lexmark is one of the IBM units which had an opportunity to find its future outside of IBM.

The company purchased three vendors which were among the companies I monitored:

  • Brainware, the trigram folks
  • ISYS Search Software, the 1988 old school search and retrieval system
  • Kapow (via Lexmark’s purchase of Kofax), the data normalization outfit.

Also, the company’s headquarters are about an hour from my cabin next to the pond filled with mine run off. Cutbacks at Lexmark may spell more mobile homes in my neck of the woods.

Stephen E Arnold, August 27, 2015

Insights into the Cut and Paste Coding Crowd

August 26, 2015

I read “How Developers Search for Code.” Interesting. The write up points out what I have observed. Programmers search for existing — wait for it — code.

Why write something when there are wonderful snippets to recycle. Here’s the paragraph I highlighted:

We also learn that a search session is generally just one to two minutes in length and involves just one to two queries and one to two file clicks.

Yep, very researchy. Very detailed. Very shallow. Little wonder that most software rolls out in endless waves of fixes. Good enough is the sort of sigma way.

Encouraging. Now why did that air traffic control crash happen? Where are the back ups to the data in Google’s Belgium server center? Why does that wonderful Windows 10 suck down data to mobile devices with little regard for data caps? Why does malware surface in Android apps?

Good enough: the new approach to software QA/QC.

Stephen E Arnold, August 26, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta