Feature – Beyond Search

Why Google Dorks Exist and Why Most Users Do Not Know Why They Are Needed

Stephen E. Arnold — Mon, 04 Dec 2023 15:03:53 +0000

This essay is the work of a dumb dinobaby. No smart software required.

Many people in my lectures are not familiar with the concept of “dorks”. No, not the human variety. I am referencing the concept of a “Google dork.” If you do a quick search using Yandex.com, you will get pointers to different “Google dorks.” Click on one of the links and you will find information you can use to retrieve more precise and relevant information from the Google ad-supported Web search system.

Here’s what QDORKS.com looks like:

The idea is that one plugs in search terms and uses the pull down boxes to enter specific commands to point the ad-centric system at something more closely resembling a relevant result. Other interfaces are available; for example, the “1000 Best Google Dorks List." You get a laundry list of tips,commands, and ideas for wrestling Googzilla to the ground, twisting its tail, and (hopefully) yield relevant information. Hopefully. Good work.

Most people are lousy at pinning the tail on the relevance donkey. Therefore, let someone who knows define relevance for the happy people. Thanks, MSFT Copilot. Nice animal with map pins.

Why are Google Dorks or similar guides to Google search necessary? Here are three reasons:

Precision reduces the opportunities for displaying allegedly relevant advertising. Semantic relaxation allows the Google to suggest that it is using Oingo type methods to find mathematically determined relationships. The idea is that razzle dazzle makes ad blasting something like an ugly baby wrapped in translucent fabric on a foggy day look really great.
When Larry Page argued with me at a search engine meeting about truncation, he displayed a preconceived notion about how search should work for those not at Google or attending a specialist conference about search. Rational? To him, yep. Logical? To his framing of the search problem, the stance makes perfect sense if one discards the notion of tense, plurals, inflections, and stupid markers like “im” as in “impractical” and “non” as in “nonsense.” Hey, Larry had the answer. Live with it.
The goal at the Google is to make search as intellectually easy for the “user” as possible. The idea was to suggest what the user intended. Also, Google had the old idea that a person’s past behavior can predict that person’s behavior now. Well, predict in the sense that “good enough” will do the job for vast majority of search-blind users who look for the short cut or the most convenient way to get information.

Why? Control, being clever, and then selling the dream of clicks for advertisers. Over the years, Google leveraged its information framing power to a position of control. I want to point out that most people, including many Googlers, cannot perceive. When pointed out, those individuals refuse to believe that Google does [a] NOT index the full universe of digital data, [b] NOT want to fool around with users who prefer Boolean algebra, content curation to identify the best or most useful content, and [c] fiddle around with training people to become effective searchers of online information. Obfuscation, verbal legerdemain, and the “do no evil” craziness make the railroad run the way Cornelius Vanderbilt-types implemented.

I read this morning (December 4, 2023) the Google blog post called “New Ways to Find Just What You Need on Search.” The main point of the write up in my opinion is:

Search will never be a solved problem; it continues to evolve and improve alongside our world and the web.

I agree, but it would be great if the known search and retrieval functions were available to users. Instead, we have a weird Google Mom approach. From the write up:

To help you more easily keep up with searches or topics you come back to a lot, or want to learn more about, we’re introducing the ability to follow exactly what you’re interested in.

Okay, user tracking, stored queries, and alerts. How does the Google know what you want? The answer is that users log in, use Google services, and enter queries which are automatically converted to search. You will have answers to questions you really care about.

There are other search functions available in the most recent version of Google’s attempts to deal with an unsolved problem:

As with all information on Search, our systems will look to show the most helpful, relevant and reliable information possible when you follow a topic.

Yep, Google is a helicopter parent. Mom will know what’s best, select it, and present it. Don’t like it? Mom will be recalcitrant, like shaping search results to meet what the probabilistic system says, “Take your medicine, you brat.” Who said, “Mother Google is a nice mom”? Definitely not me.

And Google will make search more social. Shades of Dr. Alon Halevy and the heirs of Orkut. The Google wants to bring people together. Social signals make sense to Google. Yep, content without Google ads must be conquered. Let’s hope the Google incentive plans encourage the behavior, or those valiant programmers will be bystanders to other Googlers’ promotions and accompanying money deliveries.

Net net: Finding relevant, on point, accurate information is more difficult today than at any other point in the 50+ year work career. How does the cloud of unknowing dissipate? I have no idea. I think it has moved in on tiny Googzilla feet and sits looking over the harbor, ready to pounce on any creature that challenges the status quo.

PS. Corny Vanderbilt was an amateur compared to the Google. He did trains; Google does information.

Stephen E Arnold, December 4, 2023

Google: Running the Same Old Game Plan

Stephen E. Arnold — Mon, 31 Jul 2023 08:20:00 +0000

Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

Google has been running the same old game plan since the early 2000s. But some experts are unaware of its simplicity. In the period from 2002 to 2004, I did a number of reports for my commercial clients about Google. In 2004, I recycled some of the research and analysis into The Google Legacy. The thesis of the monograph, published in England by the now defunct Infonortics Ltd. explained the infrastructure for search was enhanced to provide an alternative to commercial software for personal, business, and government use. The idea that a search-and-retrieval system based on precedent technology and funded in part by the National Science Foundation with a patent assigned to Stanford University could become Googzilla was a difficult idea to swallow. One of the investment banks who paid for our research got the message even though others did not. I wonder if that one group at the then world’s largest software company remembers my lecture about the threat Google posed to a certain suite of software applications? Probably not. The 20 somethings and the few suits at the lecture looked like kindergarteners waiting for recess.

I followed up The Google Legacy with Google Version 2.0: The Calculating Predator. This monograph was again based on proprietary research done for my commercial clients. I recycled some of the information, scrubbing that which was deemed inappropriate for anyone to buy for a few British pounds. In that work, I rather methodically explained that Google’s patent documents provided useful information about why the mere Web search engine was investing in some what seemed like odd-ball technologies like software janitors. I reworked one diagram to show how the Google infrastructure operated like a prison cell or walled garden. The idea is that once one is in, one may have to work to get past the gatekeeper to get out. I know the image from a book does not translate to a blog post, but, truth be told, I am disinclined to recreate art. At age 78, it is often difficult to figure out why smart drawing tools are doing what they want, not what I want.

Here’s the diagram:

The prison cell or walled garden (2006) from Google Version 2.0: The Calculating Predator, published by Infonortics Ltd., 2006. And for any copyright trolls out there, I created the illustration 20 years ago, not Alamy and not Getty and no reputable publisher.

Three observations about the diagram are: [a] The box, prison cell, or walled garden contains entities, [b] once “in” there is a way out but the exit is via Google intermediated, defined, and controlled methods, and [c] anything in the walled garden perceives that the prison cell is the outside world. The idea obviously is for Google to become the digital world which people will perceive as the Internet.

I thought about my decades old research when I read “Google Tries to Defend Its Web Environment Integrity as Critics Slam It as Dangerous.” The write up explains that Google wants to make online activity better. In the comments to the article, several people point out that Google is using jargon and fuzzy misleading language to hide its actual intentions with the WEI.

The critics and the write up miss the point entirely: Look at the diagram. WEI, like the AMP initiative, is another method added to existing methods for Google to extend its hegemony over online activity. The patent, implement, and explain approach drags out over years. Attention spans, even for academics who make up data like the president of Stanford University, are not interested in anything other than personal goal achievement. Finding out something visible for years is difficult. When some interesting factoid is discovered, few accept it. Google has a great brand, and it cares about user experience and the other fog the firm generates.

MidJourney created this nice image of a Googler preparing for a presentation to the senior management of Google in 2001. In that presentation, the wizard was outlining Google’s fundamental strategy: Fake left, go right. The slogan for the company, based on my research, keep them fooled. Looking the wrong way is the basic rule of being a successful Googler, strategist, or magician.

Will Google WEI win? It does not matter because Google will just whip up another acronym, toss some verbal froth around, and move forward. What is interesting to me is Google’s success. Points I have noted over the years are:

Kindergarten colors, Google mouse pads, and talking like General Electric once did about “bringing good things” continues to work
Google’s dominance is not just accepted, changing or blocking anything Google wants to do is sacrilegious. It has become a sacred digital cow
The inability of regulators to see Google as it is remains a constant, like Google’s advertising revenue
Certain government agencies could not perform their work if Google were impeded in any significant way. No, I will not elaborate on this observation in a public blog post. Don’t even ask. I may make a comment in my keynote at the Massachusetts / New York Association of Crime Analysts’ conference in early October 2023. If you can’t get in, you are out of luck getting information on Point Four.

Net net: Fire up your Chrome browser. Look for reality in the Google search results. Turn cartwheels to comply with Google’s requirements. Pay money for traffic via Google advertising. Learn how to create good blog posts from Google search engine optimization experts. Use Google Maps. Put your email in Gmail. Do the Google thing. Then ask yourself, “How do I know if the information provided by Google is “real”? Just don’t get curious about synthetic data for Google smart software. Predictions about Big Brother are wrong. Google, not the government, is the digital parent whom you embraced after a good “Backrub.” Why change from high school science thought processes? If it ain’t broke, don’t fix it.

Stephen E Arnold, July 31, 2023

The Google Reorg. Will It Output Xooglers, Not Innovations?

Stephen E. Arnold — Tue, 25 Apr 2023 09:30:00 +0000

Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

My team and I have been talking about the Alphabet decision to merge DeepMind with Google Brain. Viewed from one angle, the decision reflects the type of efficiency favored by managers who value the idea of streamlining. The arguments for consolidation are logical; for example, the old tried-and-true buzzword synergy may be invoked to explain the realignment. The decision makes business sense, particularly for an engineer or a number-oriented MBA, accountant, or lawyer.

Arguing against the “one plus one equals three” viewpoint may be those who have experienced the friction generated when staff, business procedures, and projects get close, interact, and release energy. I use the term “energy” to explain the dormant forces unleashed as the reorganization evolves. When I worked at a nuclear consulting firm early in my career, I recall the acrimonious and irreconcilable differences between a smaller unit in Florida and a major division in Maryland. The fix was to reassign personnel and give up on the dream of one big, happy group.

This somewhat pathos-infused image was created using NightCafe Creator and Craiyon. The author (a dinobaby) added the caption which may appeal to large language model-centric start ups with money, ideas, and a “we can do this” vibe.

Over the years, my team and I have observed Google’s struggles to innovate. The successes have been notable. Before the Alphabet entity was constructed, the “old” Google purchased Keyhole, Inc. (a spin-off of the gaming company Intrinsic). That worked after the US government invested in the company. There have been some failures too. My team followed the Orkut product which evolved from a hire named Orkut Büyükkökten, who had developed an allegedly similar system while working at InCircle. Orkut was a success, particularly among users in Brazil and a handful of other countries. However, some Orkut users relied on the system for activities which some found unacceptable. Google killed the social networking system in 2014 as Facebook surged to global prominence as Google’s efforts fell to earth. The company was in a position to be a player in social media, and it botched the opportunity. Steve Ballmer allegedly described Google as a “one-trick pony.” Mr. Ballmer’s touch point was Google’s dependence on online advertising: One source of revenue; therefore, a circus pony able to do one thing. Mr. Ballmer’s quip illustrates the fact that over the firm’s 20-plus year history, Google has not been able to diversify its revenue. More than two-thirds of the company’s money comes directly or indirectly from advertising.

My team and I have watched Google struggle to accept adapt its free-wheeling style to a more traditional business approach to policies and procedures. In one notable incident, my team and I were involved in reviewing proposals to index the content of the US Federal government. Google was one of the bidders. The Google proposal did not follow the expected format of responding to each individual requirement in the request for proposal. In 2000, Google professionals made it clear its method did not require that the government’s statement of work be followed. Other vendors responded, provided the required technical commentary, and produced cost estimates in a format familiar to those involved in the contracting award process. Flash forward 23 years, and Google has figured out how to capture US government work.

The key point: The learning process took a long time.

Why is this example relevant to the Alphabet decision to blend the Brain and DeepMind units? Change — despite the myths of Silicon Valley — is difficult for Alphabet. The tensions at the company are well known. Employees and part-time workers grouse and sometimes carry signs and disturb traffic. Specific personnel matters become, rightly or wrongly, messages that say, Google is unfair. The Google management generated an international spectacle with its all-thumbs approach to human relations. Dr. Timnit Gebru was a co-author of a technical paper which identified a characteristic of smart software. She and several colleagues explained that bias in training data produces results which are skewed. Anyone who has used any of the search systems which used open source libraries created by Google know that outputs are variable, which is a charitable way of saying, “Dr. Gebru was correct.” She became a Xoogler, set up a new organization, and organized a conference to further explain her research — the same research which ruffled the feathers of some Alphabet big birds.

The pace of generative artificial intelligence is accelerating. Disruption can be smelled like ozone in an old-fashioned electric power generation station. My team and I attempt to continue tracking innovations in smart software. We cannot do it. I am prepared to suggest that the job is quite challenging because the flow of new ChatGPT-type products, services, applications, and features is astounding. I recall the early days of the Internet when in 1993 I could navigate to a list of new sites via Mosaic browser and click on the ones of interest. I recall that in a matter of months the list grew too long to scan and was eventually discontinued. Smart software is behaving in this way: Too many people are doing too many new things.

I want to close this short personal essay with several points.

First, mashing up different cultures and a history of differences will act like a brake and add friction to innovative work. Such reorganizations will generate “heat” in the form of disputes, overt or quiet quitting, and an increase in productivity killers like planning meetings, internal product pitches, and getting legal’s blessing on a proposed service.

Second, a revenue monoculture is in danger when one pest runs rampant. Alphabet does not have a mechanism to slow down what is happening in the generative AI space. In online advertising, Google has knobs and levers. In the world of creating applications and hooking them together to complete tasks, Alphabet management seems to lack a magic button. The pests just eat the monoculture’s crop.

Third, the unexpected consequence of merging Brain and DeepMind may be creating what I call a “Xoogler Manufacturing Machine.” Annoyed or “grass is greener” Google AI experts may go to one of the many promising generative AI startups. Note: A former Google employee is sometimes labeled a “Xoogler,” which is shorthand for ex-Google employee.

Net net: In a conversation in 2005 with a Google professional whom I cannot name due to the confidentiality agreement I signed with the firm, I asked, “Do you think people and government officials will figure out what Google is really doing?” This person, who was a senior manager, said to the best of my recollection, “Sure and when people do, it’s game.” My personal view is that Alphabet is in a game in which the clock is ticking. And in the process of underperforming, Alphabet’s advertisers and users of free and for-fee services will shift their attention elsewhere, probably to a new or more agile firm able to leverage smart software. Alphabet’s most recent innovation is the creation of a Xoogler manufacturing system. The product? Former Google employees who want to do something instead of playing in the Alphabet sandbox with argumentative wizards and several ill-behaved office pets.

Stephen E Arnold, April 24, 2023

Search and Retrieval: A Sub Sub Assembly

Stephen E. Arnold — Mon, 02 Jan 2023 15:32:42 +0000

What’s happening with search and retrieval? Google’s results irritate some; others are happy with Google’s shaping of information. Web competitors exist; for example, Kagi.com and Neva.com. Both are subscription services. Others provide search results “for free”; examples include Swisscows.com and Yandex.com. You can find metasearch systems (minimal original spidering, just recycling results from other services like Bing.com); for instance, StartPage.com (formerly Ixquick.com) and DuckDuckGo.com. Then there are open source search options. The flagship or flagships are Solr and Lucene. Proprietary systems exist too. These include the ageing X1.com and the even age-ier Coveo system. Remnants of long-gone systems are kicking around too; to wit, BRS and Fulcrum from OpenText, Fast Search now a Microsoft property, and Endeca, owned by Oracle. But let’s look at search as it appears to a younger person today.

A decayed foundation created via smart software on the Mage.space system. A flawed search and retrieval system can make the structure built on the foundation crumble like Southwest Airlines’ reservation system.

First, the primary means of access is via a mobile device. Surprisingly, the source of information for many is video content delivered by the China-linked TikTok or the advertising remora YouTube.com. In some parts of the world, the go-to information system is Telegram, developed by Russian brothers. This is a centralized service, not a New Wave Web 3 confection. One can use the service and obtain information via a query or a group. If one is “special,” an invitation to a private group allows access to individuals providing information about open source intelligence methods or the Russian special operation, including allegedly accurate video snips of real-life war or disinformation.

The challenge is that search is everywhere. Yet in the real world, finding certain types of information is extremely difficult. Obtaining that information may be impossible without informed contacts, programming expertise, or money to pay what would have been called “special librarian research professionals” in the 1980s. (Today, it seems, everyone is a search expert.)

Here’s an example of the type of information which is difficult if not impossible to obtain:

The ownership of a domain
The ownership of a Tor-accessible domain
The date at which a content object was created, the date the content object was indexed, and the date or dates referenced in the content object
Certain government documents; for example, unsealed court documents, US government contracts for third-party enforcement services, authorship information for a specific Congressional bill draft, etc.
A copy of a presentation made by a corporate executive at a public conference.

I can provide other examples, but I wanted to highlight the flaws in today’s findability.

What do these examples say about the efficacy of search?

Years ago, for Searcher Magazine, I wrote an article called “Search Sucks.” I think the editor Barbara Quint changed it to a more politically correct and less accurate title like “Search Does Not Work.” The main point of the piece was to identify the types of unsolved retrieval issues confronting professionally-trained online and traditional researchers. The same problems exist today.

Now many pundits and AI advocates are pitching smart software as the optimal way forward in findability. One of the more interesting mutations of search is described in “AI Allows Dead Woman to Talk to People Who Showed Up at Her Funeral.” Natural language processing, linguistic patterns, and a corpus of text enables smart software to talk with a deceased person.

Another interesting evolutionary mutation strikes at the heart of search vendors who endlessly pitch their search and retrieval system as a way to deliver enhanced customer support. “Customer support” means lower cost interactions with customers. The most recent example of search disguised in a different software shell is “Companies Can Hire a Virtual Person for about $14k a Year in China.” The main idea is that customer service can be delivered via a natural language avatar.

Both of these examples make clear that search and retrieval is now a sub sub system. Keyword matching and semantic analysis make it possible to understand input, craft and answer, and deliver it in a way that requires minimal effort on the part of the person wanting information.

But has search reached a stage of refinement to make sense of what the person interacting with a findability system to deliver high-value answers? I would suggest that today’s search has improved since the days of NASA RECON, SDC Orbit, STAIRS III, and other old-school systems. However, today’s smart software is often as effective as the original Smart system developed by Gerard Salton and his colleagues at Cornell University in the 1960s.

For me, search and retrieval — whether delivered with Dialog Information Services command line or the weirdness of Amazon’s Alexa — leave me with a sense of opportunities lost. Search and retrieval is more than mindless matching or statistical probabilities based on masses of Web content. Developers and vendors continue to dodge such fundamental issues as editorial policy, computational cost, investment in development of systems that solve high-value problems, methods to reduce bias in a result set, and communication about what caused a certain result to appear in response to a user or system input.

I know first-hand how foreign some of these dinobaby points are to today’s search and retrieval experts. Relevance has become an afterthought, particularly when advertising dollars are a lubricant. Precision is difficult due to synonym expansion, hidden stop words, and arbitrary decisions about what to expose to a user or a system.

I am concerned about several trends which I think may become more evident in 2023. Feel free to disagree. I am at an age and station in life that criticism of my ideas is familiar.

First, the emergence of what I call the DYOR expert. This is an individual, a group, a company, or an association which positions itself as an expert researcher. Some of these open source experts do not have training in special librarian tools, mindset, or techniques. Commercial services are ignored because they cost money. Why not use Twitter instead? What I think is emerging is a class of online researchers who will manifest some information blind spots as a result of haste, budget constraints, and knowledge gaps. I am not sure how to address this issue because OSINT is the trendy way to get information. If I raise a doubt, I hear, “Look at the Ukraine Russia war. Without OSINT we would know nothing.” I think it depends on the institution with which the OSINT advocate works. DYOR methods can mislead or just be wrong.

Second, the diminution of search and retrieval. I think many of today’s findability systems have taken their eye off the ball. The difficult parts of search are ignored or assumed to be solved. Why not download something from an open source software repository. That will be good enough. The results will be close enough for horseshoes. That attitude is, in my view, dangerous. As automation becomes increasingly pervasive, it will be impossible to identify that an issue may be buried deep in a sub sub system. The fix is to ignore the problem or create a software shell and move forward. The consequences of this mindset are likely to have some interesting and dire unforeseen consequences.

Third, the idea that anyone and everyone is an expert in search and retrieval is a potentially dangerous illusion. Knowing how to assess a content object and having the know how to verify a datum or an item of information is a key part of a knowledge toolkit. If people don’t know about such a toolkit and have zero desire to master the specific tools, decisions are likely to off the rails. One current example is the failure of Southwest Airlines to be an airline. Search flaws almost guarantee that findability will suffer the same fate as those abandoned bags and idle airplanes.

Net net: Search is an issue. I think that in 2023 more people will realize that it is far more important than cheaper customer support or a weird voice from beyond the grave.

Stephen E Arnold, January 2, 2022

A Xoogler May Question the Google about Responsible and Ethical Smart Software

Stephen E. Arnold — Thu, 02 Dec 2021 14:31:03 +0000

Write a research paper. Get colleagues to provide input. Well, ask colleagues do that work and what do you get. How about “Looks good.” Or “Add more zing to that chart.” Or “I’m snowed under so it will be a while but I will review it…” Then the paper wends its way to publication and a senior manager type reads the paper on a flight from one whiz kid town to another whiz kid town and says, “This is bad. Really bad because the paper points out that we fiddle with the outputs. And what we set up is biased to generate the most money possible from clueless humans under our span of control.” Finally, the paper is blocked from publication and the offending PhD is fired or sent signals that your future lies elsewhere.

Will this be a classic arm wrestling match? The winner may control quite a bit of conceptual territory along with knobs and dials to shape information.

Could this happen? Oh, yeah.

“Ex Googler Timnit Gebru Starts Her Own AI Research Center” documents the next step, which may mean that some wizards undergarments will be sprayed with eau de poison oak for months, maybe years. Here’s one of the statements from the Wired article:

“Instead of fighting from the inside, I want to show a model for an independent institution with a different set of incentive structures,” says Gebru, who is founder and executive director of Distributed Artificial Intelligence Research (DAIR). The first part of the name is a reference to her aim to be more inclusive than most AI labs—which skew white, Western, and male—and to recruit people from parts of the world rarely represented in the tech industry. Gebru was ejected from Google after clashing with bosses over a research paper urging caution with new text-processing technology enthusiastically adopted by Google and other tech companies.

The main idea, which Wired and Dr. Gebru delicately sidestep, is that there are allegations of an artificial intelligence or machine learning cabal drifting around some conference hall chatter. On one side is the push for what I call the SAIL approach. The example I use to illustrate how this cost effective, speedy, and clever short cut approach works is illustrated in some of the work of Dr. Christopher Ré, the captain of the objective craft SAIL. Oh, is the acronym unfamiliar to you? SAIL is short version of Stanford Artificial Intelligence Laboratory. SAIL fits on the Snorkel content diving gear I think.

On the other side of the ocean, are Dr. Timnit Gebru’s fellow travelers. The difference is that Dr. Gebru believes that smart software should not reflect the wit, wisdom, biases, and general bro-ness of the high school science club culture. This culture, in my opinion, has contributed to the fraying of the social fabric in the US, caused harm, and erodes behaviors that are supposed to be subordinated to “just what people do to make a social system function smoothly.”

Does the Wired write up identify the alleged cabal? Nope.

Does the write up explain that the Ré / Snorkel methods sacrifice some precision in the rush to generate good enough outputs? (Good enough can be framed in terms of ad revenue, reduced costs, and faster time to market testing in my opinion.) Nope.

Does Dr. Gebru explain how insidious the short cut training of models is and how it will create systems which actively harm those outside the 60 percent threshold of certain statistical yardsticks? Heck, no.

Hopefully some bright researchers will explain what’s happening with a “deep dive”? Oh, right, Deep Dive is the name of a content access company which uses Dr. Ré’s methods. Ho, ho, ho. You didn’t know?

Beyond Search believes that Dr. Gebru has important contributions to make to applied smart software. Just hurry up already.

Stephen E Arnold, December 2, 2021

Search Engines: Bias, Filters, and Selective Indexing

Stephen E. Arnold — Mon, 15 Mar 2021 05:15:34 +0000

I read “It’s Not Just a Social Media Problem: How Search Engines Spread Misinformation.” The write up begins with a Venn diagram. My hunch is that quite a few people interested in search engines will struggle with the visual. Then there is the concept that typing in a search team returns results are like loaded dice in a Manhattan craps game in Union Square.

The reasons, according to the write up, that search engines fall off the rails are:

Relevance feedback or the Google-borrowed CLEVER method from IBM Almaden’s patent
Fake stories which are picked up, indexed, and displayed as value infused,

The write up points out that people cannot differentiate between accurate, useful, or “factual” results and crazy information.

Okay, here’s my partial list of why Web search engines return flawed results:

Stop words. Control the stop words and you control the info people can find
Stored queries. Type what you want but get the results already bundled and ready to display.
Selective spidering. The idea is that any index is a partial representation of the possible content. Instruct spiders to skip Web sites with information about peanut butter, and, bingo, no peanut butter information
Spidering depth. Is the bad stuff deep in a Web site? Just limit the crawl to fewer links?
Spider within a span. Is a marginal Web site linking to sites with info you want killed? Don’t follow links off a domain.
Delete the past. Who looks at historical info? A better question, “What advertiser will pay to appear on old content?” Kill the backfile. Web indexes are not archives no matter what thumbtypers believe.

There are other methods available as well; for example, objectionable info can be placed in near line storage so that results from questionable sources display with latency or slow enough to cause the curious user to click away.

To sum up, some discussions of Web search are not complete or accurate.

Stephen E Arnold, March 15, 2021

Open Source Software: The Community Model in 2021

Stephen E. Arnold — Mon, 25 Jan 2021 15:11:24 +0000

I read “Why I Wouldn’t Invest in Open-Source Companies, Even Though I Ran One.” I became interested in open source search when I was assembling the first of three editions of Enterprise Search Report in the early 2000s. I debated whether to include Compass Search, the precursor to Shay Branon’s Elasticsearch reprise. Over the years, I have kept my eye on open source search and retrieval. I prepared a report for an the outfit IDC, which happily published sections of the document and offering my write ups for $3,000 on Amazon. Too bad IDC had no agreement with me, managers who made Daffy Duck look like a model for MBAs, and a keen desire to find a buyer. Ah, the book still resides on one of my back of drives, and it contains a run down of where open source was getting traction. I wrote the report in 2011 before getting the shaft-a-rama from a mid tier consulting firm. Great experience!

The report included a few nuggets which in 2011 not many experts in enterprise search recognized; for instance:

Large companies were early and enthusiastic adopters of open source search; for example Lucene. Why? Reduce costs and get out of the crazy environment which put Fast Search & Transfer-type executives in prison for violating some rules and regulations. The phrase I heard in some of my interviews was, “We want to get out of the proprietary software handcuffs.” Plus big outfits had plenty of information technology resources to throw at balky open source software.
Developers saw open source in general and contributing to open source information retrieval projects as a really super duper way to get hired. For example, IBM — an early enthusiast for a search system which mostly worked — used the committers as feedstock. The practice became popular among other outfits as well.
Venture outfits stuffed with oh-so-technical MBAs realized that consulting services could be wrapped around free software. Sure, there were legal niceties in the open source licenses, but these were not a big deal when Silicon Valley super lawyers were just a text message away.

There were other findings as well, including the initiatives underway to embed open source search, content processing, and related functions into commercial products. Attivio (formed by former super star managers from Fast Search & Transfer), Lucid Works, IBM, and other bright lights adopted open source software to [a] reduce costs, [b] eliminate the R&D required to implement certain new features, and [c] develop expensive, proprietary components, training, and services.

In case you did not know, value-added services and proprietary generate the big bucks in search, not the license fees. Palantir Technologies, uses open source software, and almost mandatory on site engineers and consultants. Why? That’s the bestest way ever to create 21st century lock in. The approach appears to be working. “Appears” is an operative word.

Now back to the essay, which contains a list of some open source software business models; to wit:

Open-Core: The company open-sources a slimmed-down version of its product and sells a fully-featured “enterprise version” on top (e.g., Kafka/Confluent or Docker/Docker EE)
Hosted Version: The company offers a fully managed hosted version of its product ( e.g., MongoDB or Grafana)
Support & Consultancy: The company offers support and consultancy services around its open-source product (e.g., Elastic or MongoDB)

The best part of the write up is, in my opinion, this statement:

Each of these models come with inherent conflicts of interest.

I want to point out that the outlook for certain popular open source software — for instance, Lucene/Solr — is likely to follow the path Amazon has taken with Elasticsearch.

What do I make of this open source thing? It was great while it lasted. But software is moving on. Remember, please, that the cloud giants want customers to forget about software and think about the bigger picture: No code, snap together solutions, and subscriptions for the right to pay for engineering and consulting services.

Stephen E Arnold, January 25, 2021

Security Vendors: Despite Marketing Claims for Smart Software Knee Jerk Response Is the Name of the Game

Stephen E. Arnold — Wed, 16 Dec 2020 14:44:20 +0000

Update 3, December 16, 2020 at 1005 am US Eastern, the White House has activate its cyber emergency response protocol. Source: “White House Quietly Activates Cyber Emergency Response” at Cyberscoop.com. The directive is located at this link and verified at 1009 am US Eastern as online.

Update 2, December 16, 2020 at 1002 am US Eastern. The Department of Treasury has been identified as a entity compromised by the SolarWinds’ misstep. Source: US “Treasury, Commerce Depts. Hacked through SolarWinds Compromise” at KrebsonSecurity.com

Update 1, December 16, 2020, at 950 am US Eastern. The SolarWinds’ security misstep may have taken place in 2018. Source: “SolarWinds Leaked FTP Credentials through a Public GitHub Repo “mib-importer” Since 2018” at SaveBreach.com

I talked about security theater in a short interview/conversation with a former CIA professional. The original video of that conversation is here. My use of the term security theater is intended to convey the showmanship that vendors of cyber security software have embraced for the last five years, maybe more. The claims of Dark Web threat intelligence, the efficacy of investigative software with automated data feeds, and Bayesian methods which inoculate a client from bad actors— maybe this is just Madison Avenue gone mad. On the other hand, maybe these products and services don’t work particularly well. Maybe these products and services are anchored in what bad actors did yesterday and are blind to the here and now of dudes and dudettes with clever names?

Evidence of this approach to a spectacular security failure is documented in the estimable Wall Street Journal (hello, Mr. Murdoch) and the former Ziff entity ZDNet. Numerous online publications have reported, commented, and opined about the issue. One outfit with a bit of first hand experience with security challenges (yes, I am thinking about Microsoft) reported “SolarWinds Says Hack Affected 18,000 Customers, Including Two Major Government Agencies.”

One point seems to be sidestepped in the coverage of this “concern.” The corrective measures kicked in after the bad actors had compromised and accessed what may be sensitive data. Just a mere 18,000 customers were affected. Who were these “customers”? The list seems to have been disappeared from the SolarWinds’ Web site and from the Google cache. But Newsweek, an online information service, posted this which may, of course, be horse feathers (sort of like security vendors’ security systems?):

Notice that the US Secret Service is on this list. How many other US government enforcement agencies were SolarWinds’ customers? That’s an interesting question?

Net net:

Bad actors compromised a security vendor but only 18,000 customers were affected. Yes, that’s good news I suppose.
Numerous companies jumped on video conference calls and figured out how to deal with the bad actors’ activities. These activities began exactly when? Yes, that’s another interesting question.
Brightcove is pitching its security videos in the midst of this “only 18,000 customers” thing; for example, The Science of Cybersecurity: Digital Transformation in Retail
Courtney Radke Fortinet National Retail CISO and Theresa Lanowitz Director, AT&T Cybersecurity.

Security theater and its regularly scheduled programming is uninterrupted.

A final question: When will software systems be upfront about their true behaviors? Yep, another interesting question. But I know the answer to this one: Exactly never.

Stephen E Arnold, December 16, 2020

Google: Simplifying Excellence

Stephen E. Arnold — Thu, 22 Oct 2020 14:39:35 +0000

Almost everyone knows Google. I spotted an eclectic write up in Entertainment Overdose (an estimable publication). The article “Eric Schmidt, Who Got YouTube for a Premium, Assumes Social Media Networks Are Amplifiers for Idiots” contains a quote. This is an alleged statement attributed to Eric Schmidt, the overseer of Google until 2018.

Here’s the alleged pearl of wisdom:

The context of social networks serving as amplifiers for idiots and crazy people is not what we intended.

But it happened with YouTube, right? Who was running the company at this time? I think it was Mr. Schmidt.

It seems that Mr. Schmidt’s social world view is divided into those who are not crazy (possibly Google employees and those who share some Google mental characteristics but are in some way in touch with reality) and those who are crazy. Crazy means mentally deranged, which may be a bad thing. Plus, the “crazy” group uses social media as “amplifiers.” This seems to suggest that anyone using social media falls into the crazy category. Is this correct?

Note the “we”. The royal “we” appears to embrace the senior management of Google.

Now check out the Rupert Murdoch “real” news Wall Street Journal for October 22, 2020. The story to which I direct your attention is called “Google Ex-CEO Hits DOJ As Antitrust Battle Looms.” [When the story is posted to wsj.com, you will have an opportunity to purchase access. Until then, hunt for the dead tree edition and look on Page A-1.]

The write up reports that Mr. Schmidt said:

There’s a difference between dominance and excellence.

Is the idea may be that operating like a plain vanilla monopoly not acceptable. This suggests that monopoly delivering “excellence” is a positive for everyone.

Is YouTube dominant or excellent? Are those who post links to children’s playgrounds to the delight of individuals with proscribed tendencies idiots? (There are other, more suitable terms I believe.)

And that may bring up other questions; for example:

What about YouTube? That’s a social media type service which generates billions in ad revenue. What percentage of the YouTube content is for Googley types? What percentage for the crazies? Why is crazy content allowed on YouTube? Why does YouTube allow videos which show users how to steal commercial software? Run this query: photoshop cs6 crack. Here’s what I saw displayed on October 22, 2020, at 1023 am US Eastern:

Net net: Google operates in a medieval “great chain of being” mode. Google is at the apex. Others are lower on the chain. True, Google is interesting and successful. Those who explain the “excellence” of Google are minimizing the social impact of its actions. But if you understand the difference between dominance and excellence, all is right with the world. If you don’t get it, Google has some descriptors: idiot and crazy. There is a difference for sure.

Excellence may not be the correct word.

Stephen E Arnold, October 22, 2020

Exclusive: Interview with DataWalk’s Chief Analytics Officer Chris Westphal, Who Guides an Analytics Rocket Ship

Stephen E. Arnold — Wed, 21 Oct 2020 09:30:55 +0000

I spoke with Chris Westphal, Chief Analytics Officer for DataWalk about the company’s string of recent contract “wins.” These range from commercial engagements to heavy lifting for the US Department of Justice.

Chris Westphal, founder of Visual Analytics (acquired by Raytheon) brings his one-click approach to advanced analytics.

The firm provides what I have described as an intelware solution. DataWalk ingests data and outputs actionable reports. The company has leap-frogged a number of investigative solutions, including IBM’s Analyst’s Notebook and the much-hyped Palantir Technologies’ Gotham products. This interview took place in a Covid compliant way. In my previous Chris Westphal interviews, we met at intelligence or law enforcement conferences. Now the experience is virtual, but as interesting and information in July 2019. In my most recent interview with Mr. Westphal, I sought to get more information on what’s causing DataWalk to make some competitors take notice of the company and its use of smart software to deliver what customers want: Results, not PowerPoint presentations and promises. We spoke on October 8, 2020.

DataWalk is an advanced analytics tool with several important innovations. On one hand, the company’s information processing system performs IBM i2 Analyst’s Notebook and Palantir Gotham type functions — just with a more sophisticated and intuitive interface. On the other hand, Westphal’s vision for advanced analytics has moved past what he accomplished with his previous venture Visual Analytics. Raytheon bought that company in 2013. Mr. Westphal has turned his attention to DataWalk. The full text of our conversation appears below.

The Westphal Interview: Autumn 2020

Thanks for taking the time to speak with me today. Why DataWalk:

Raytheon acquired my previous company, Visual Analytics, in 2013 and I initially stayed on to help with the transition. Raytheon is an exceptional company, they are extremely innovative, they help defend and protect our national security, they have incredible talent, and, they’re really good at integrating and configuring technologies. However, Raytheon is not a commercial software development shop and I wanted to get back-to-basics with a smaller and more agile organization to serve a broader “investigative” community from regional police organizations to large federal agencies. I saw the DataWalk platform as the catalyst to enable this transition and with my background, I could help guide and affect its progression into the marketplace.

Would you describe the principal technical and feature differences between Visual Analytics and DataWalk?

When I co-founded Visual Analytics back in 1998, the Apache Software Foundation did not exist. Thus, we had to innovate, design, and create some very advanced software, techniques, and systems from scratch to deliver our analytical platform (Data Clarity) into this fledgling marketplace. It was challenging, invigorating, and rewarding to see our system evolve and address some very complex environments across a wide range of agencies including the Defense Intelligence Agency, DISA, FinCEN, IRS, FBI, CIA, US Army, US Marshals Service, and the New York Police Department to name a few, plus deployments to over 40 countries while working extensively through our partnership channels. The system was very capable and feature-rich but it required some upfront training to configure and use properly.

I’ve always been very “user-focused” and have significant client-empathy to deliver better analytics through powerful visualizations, intuitive interactions, and focused outcomes. One goal I always wanted to achieve was to define a process to naturally and non-intrusively capture the tacit knowledge a user employs while analyzing data. A way to share this encoded knowledge with other users in a secure, reliable, and manageable fashion using methods that are transparent and easy to understand, evaluate, and audit.

DataWalk eloquently addresses this need as each click, filter, query, or selection in the graphical user interface, called the Universe Viewer, generates a breadcrumb defining the selected action. You can see examples on the DataWalk Web site. As users progress in their analyses, these breadcrumbs chain together and form workflows. With DataWalk it’s easy to go back and change a parameter, pursue a different analytical branch, or simply save the results.

Once the workflow is saved, they are accessible in a user-dashboard to re-run or use as an alert to monitor for specific changes in the underlying data sets. They also form the basis for risk-scoring where each workflow is individually weighted. There can be dozens of scores created with different combinations of workflows and any single entity (for example, a person, address, transaction, account, etc.) can be part of many risk scores – which are easily aggregated into a master-risk-score, if required.

Imagine creating a library of workflows to cover the conditions for analyzing suspicious banking transactions, detecting fraudulent activities, flagging improper payments, evaluating communication records for collusion, examining log files for cyber-crimes, exposing human trafficking patterns, or monitoring sources to detect terrorist behaviors. DataWalk has achieved these capabilities using a consistent and repeatable framework to capture and democratize the domain knowledge.

Palantir Technologies went public and as part of that process, the company revealed that it had fewer than 140 customers and was losing hundreds of millions of dollars. What differentiates DataWalk from Palantir?

My elevator speech often starts with, “We are a cost-affordable and integrator friendly alternative to Palantir…” Fundamentally, our business models are very different. Palantir tends to be a one-stop shop (all-or-nothing) delivering professional services wrapped around a core technology. Whereas DataWalk directly sells software licenses designed as an easy-to-configure, user-centric platform that quickly couples with different sources and external (federated) systems.

DataWalk’s system and method for accessing processed content delivers cross-corpus results. A DataWalk user gets a view across text, numbers, and other content without having to perform separate data manipulations.

In comparing the technical features and functions, as posted by Palantir regarding their Titan release a majority of their “innovative” features appear to reflect capabilities that already exist within DataWalk. Both are highly scalable, both support collaboration, and both support accessing external interfaces – to name a few similarities. Of course, each tool has their own strengths and weaknesses depending on what requirements are evaluated. However, there are some “real” differences between our platforms.

One major difference is that “all” data must be converted into an internal Palantir entity format (via pXML) which requires additional time, effort, and costs to reformat the data into a proprietary Palantir ontology. Extracting this content along with any derived analytics is not straightforward and I believe was one of the reasons why the New York Police Department terminated the use of Palantir, which was written up by Buzzfeed. With DataWalk, everything is done using open standards. It’s easy to get data in and easy to get data out with no transformations required. And, the client owns all their data, all their analytics, and anything else produced using the platform. Period.

Another major difference is the price and configuration. As shown in GSA Schedule-70, the cost of a single Palantir core is $141,015 verses $35,000 for DataWalk; we’re over 75 percent less expensive. See (starting on page 34) their GSA schedule to confirm the costs. Furthermore, DataWalk, also on GSA (NASA SEWP, CIO-CS, DOJ-BPA), offers clients several licensing models (core, concurrent, perpetual, and term) to best fit their budget and usage requirement. A basic system for five users costs less than $100,000 to purchase or less than $5,000 per month to lease.

In past engagements, Palantir required “significant” consulting time for their forward deployed engineers (ninjas) to configure the system to meet customer needs. Many issues including transparency, inflated costs, and complex usage are discussed in a Wired article about Palantir; for instance and I quote:

Palantir uses an opaque pricing model and does not discretely identify to customers the costs of software, hardware, equipment, and professional services… Palantir Technologies, Inc. is the only vendor available to provide support and maintenance on Palantir’s Gotham software platform… [which] is proprietary to Palantir Technologies, Inc.

Finally, many well-respected system integrators operating in the government sector do not work with Palantir, as Palantir generally does not “play-well-with-others” as stated in the Wired article – thereby limiting your choice of what vendor can provide the onsite services and support. DataWalk is a true commercial off the shelf platform with roadmaps, APIs, manuals, bug-fixes, release notes, training guides, and even a partner program. Plus, we work very well with both clients and integrators to get them proficient on our platform. This avoids “vendor lock-in” and saves significant resources and costs for ongoing operations and maintenance.

The market for intelware or policeware is limited and in many statements of work, the emphasis is upon a compromise of excellence and price. What’s your approach to the policeware market?

Most opportunities or requests for proposals from law enforcement look for an integrated solution to analyze data from their cases, arrests, incidents, leads, license plates, parking tickets, gangs, gun permits, accidents, jail, probation, and different records management systems (RMS). Often, they want to combine multiple technologies to incorporate search, analytics, charting, mapping, prediction, and reporting. And, they want it easy to use, easy to maintain, and easy to train.

The DataWalk interface provides visual cues for an investigative workflow and smart icons. The idea is that the system provides the functions required to deliver the information required to address a specific issue in a case.

Cost is always a top concern for law enforcement operations. As agencies transition to become more data-centric, they embrace the time-to-results and consistency obtained from using an enterprise-wide analytical capability. Affordability is a relative dimension in this marketplace and according to a National Police Foundation article:

…the standard cost to recruit, hire, equip, and fully train a police officer from the time they submit their initial application to the time they can function independently may exceed $100,000 and take up to eighteen months.”

For less than the cost of onboarding a single officer, a system like DataWalk delivers an ROI in a much shorter time frame (days/weeks), operates 24/7, and doesn’t charge for overtime.

Here’s my main point: Our goal is not just to sell licenses – it’s to deliver operational platforms to help solve real world problems. There are approximately 18,000 police agencies within the United States and each has their own specific requirements, thus, there are no cookie-cutter deployments; one-size-does-not-fit-all. Certainly, there’s overlap among requirements, but for each deployment, we must evaluate and address their immediate needs and be adaptable to deliver on any number of future requests.

A good example of this occurred using DataWalk in a multi-jurisdictional gang task force to simplify and standardize the available data from across all the participating agencies. This group seizes a lot of mobile devices and uses digital forensics to produce some basic reports. DataWalk improved this process by creating an importer for the Cellebrite platform to ingest the content from multiple devices to cross reference calls, contacts, messages, texts, locations, and other important data.

One of the detectives ask if we could also do anything with all the images and photos contained on these devices since it was then a manual process to review. To address this need, we incorporated an API call-out to a third-party machine learning library called TensorFlow to categorize and define the content (for example, vehicles, weapons, drugs, people, and even nudity). We delivered this capability in a few days and made it available for use by all DataWalk’s clients.

Let me also add that we are looking at other markets, there are approximately 6,000 insurance companies and over 5,000 banks in the US. Although their business processes tend to be more homogenous, they still want highly configurable systems to deliver better, faster, and more accurate results that can easily adapt to new fraud schemes and money laundering patterns.

“Cyber” is a buzzword. However, the security issues facing many organizations — insider threat, data loss, bogus data, fraud, etc. — are growing problems. Where do you fit in a landscape in which “cyber” solutions may be greeted with skepticism because existing solutions either do not work or are too complicated to work?

There’s a lot of confusion and misunderstanding in the cyber-marketplace. Depending on who you ask, it means different things including detecting irregular network traffic, suspicious user behaviors, account takeovers, external threats, content abuse, or hostile actions – plus a lot more. These are all very different problem sets. However, just like any other domain, it all comes down to the “data” and what you expect to do with it. Most stakeholders are aware there are problems but are not sure how to best address them.

For example, a government agency may routinely receive online applications for benefits, loans, refunds, or entitlements. Generally, most transactions are normal in appearance and don’t set-off any warning flags. However, if different applications are submitted from the same IP address, subnet, or TOR exit node – is it suspicious? What if the IP is from a Starbucks location? Suppose they are all within a few minutes of each other? Are the email addresses similar (for example, use of dots in naming Gmail accounts)? What is the local time of the transaction – 2:00am? Do the accounts all use the same pattern to encode the password? What if the same account has multiple logins within a short time period, yet the geocoded location of the IP addresses show they are miles apart?

A DataWalk user can obtain a report about an entity with a click. The report contains pictures, documents, and videos. A click on any object reveals the underlying content, including relationship graphs or social graphs of an entity’s connections.

The cyber-domain is constantly changing and the adversaries regularly update their tactics and therefore, the solution must adapt to keep pace with them. Also, those systems only looking at a single part of the overall data will have inherent limitations; more data touch points deliver better resolution into the problem space.

What are you doing to reduce the amount of time required for an analyst to learn how to use your system and become productive?

Except for assembling IKEA furniture, most people don’t read the instructions. Thus, we’ve made our system inherently easy to use and invested heavily into designing simple interfaces using visualizations, dashboards, and graphics. We’ve limited the number of options to reduce the complexity and keep the interfaces intuitive. Of course, we provide online training guides and self-paced video-tutorials to introduce various features and show the users how to operate the platform. After a few hours, most users are productive.

In more targeted deployments, we can use preconfigured workflows to automatically generate results. We can set up risk-scores to quickly identify the most “suspicious” entities. We can train machine learning models to classify the data. We can even generate alerts to notify users when specific data conditions are met. There are many ways to help deliver results so the analysts can remain focused-on and effective-in their investigations.

A year ago I mentioned that DataWalk had boiled down your expertise to an icon infused with your expertise and smart software. The demands on a skilled analyst like yourself are significant. How are the smart icons and the DataWalk interface addressing the challenges of selecting the appropriate procedure for a specific situation?

Within many operations, there is a consistent and repeated level of turnover as personnel transition to new roles and make their assigned rotations. Unfortunately, when these assets move-on, they also take valuable knowledge and the insights learned during their tenure. DataWalk delivers a more agile and adaptable protocol where the processes, workflows, and outcomes are captured, stored, and made available for sharing, alerting, and reporting. The smart icons (for example, saved workflows) are created by the users, on their own data, in a unique context to address a specific problem or mission need. The workflows encode the organizational knowledge, where the newest analyst can consistently run the same analyses generated by seasoned users. This builds an expanding knowledgebase of expertise that is auditable, adaptable, repeatable, and remarkably transparent in its operation. Additionally, I remain involved in a lot of opportunities and contribute by helping to define the analytical models, create data transformations, recommend new sources, and incorporate third-party party functionality. I’m also working on creating knowledge libraries for specific content. Our goal is to deliver results and we’re always innovating. We want our clients to be successful and I’m part of a great support team. I’m always available. It’s my passion.

Every vendor with which I speak tells me AI, machine learning, predictive analytics, secret sauce. What are you implementing that you consider cutting edge?

One ingredient to our secret sauce is the way we made DataWalk extensible – we made it easy to expand to include other “components” to extend its functionality. Most of the add-ons (micro services) are made available through the App-Center where a published interface accesses 3rd party systems, subscription services, or various external libraries. App-Center add-ons can include Natural Language Processing (NLP) modules, AI/ML libraries, access to statistical systems like R, social media interfaces, open-source exploitation, document management systems, digital forensics, and many others. DataWalk ships with scripts available in the App Center for platforms such as Rosoka, spaCy, Whoster, WebHose.io, ShadowDragon, TensorFlow, WhoIs, Libpostal, and many-many others. It is straightforward to create scripts to extract data from systems using client-generated templates or configurations. New apps are easily created, by partners, integrators, or clients to ensure the system remains extensible to meet a wide range of needs.

The DataWalk system generates a social graph. People, activities, and other information can be absorbed quickly by the investigator.

As for machine learning, the latest version incorporates third-party libraries like H2O / AutoML to support more predictive results and actions. DataWalk facilitates the creation of machine learning models using a powerful framework to inline and manage all the processing (training) and custom algorithms (client defined). DataWalk supervises the processes to ensure they are sharable, trackable, and easy to maintain thus accommodating reusability while supporting fast iteration cycles to develop new models. Models are coded and deployed in DataWalk where the results, potentially from multiple machine learning models, are available to compare, contrast, or combine. Machine learning is embedded using simple wrapper functions to provide a unified interface to a variety of machine learning algorithms, with extensive support to help explain functionality and shorten the time to results.

If you look forward 12 to 18 months, what type of innovations will users of your system deliver to its licensees?

We’ve got a feature-list a mile long and we prioritize much of the development based on client feedback to ensure we are addressing their immediate needs. Some of the more advanced features planned for development include advanced entity-resolution methods, entity-deconfliction, automatic schema and content matching, data quality transformations, delivery of pre-encoded workflows for specific domains and data sets, automatic classification models using machine learning, and a number of new visualizations, reports formats, and output specifications. We’re also looking at novel methods for collaborating on an investigation/case with automatic markers, updates and notifications. Plus, there are a lot more apps being added to the App Center including connections to Lexis/Nexis, Thomson Reuters, TransUnion, DarkOwl, Anno.Ai, Dataiku and many more. Our goal is to deliver a positive experience to help the user make confident and well-informed decisions.

There are calls for defunding law enforcement, intelligence, and the military in the US. What’s the outlook for policeware and intelware for investigators in government and non-governmental organizations?

The outlook is very positive. The cliché “doing more with less” is apropos for the current marketplace. Under normal circumstances you really don’t need to hire new people, you just need to refocus and refactor the available resources. Organizations already have access to a lot of data – it’s a matter of using it better, smarter, and more effectively.

For example, during the recent riots in Philadelphia, a woman torched a police car. The FBI reviewed videos of the incident uploaded to Instagram and Vimeo and saw she had a distinct tattoo on her forearm and discovered her shirt was only sold from a specific Etsy store where she had posted a review. Using the name in the online review, they identified a matching Poshmark profile and from there, found that name referenced in a LinkedIn profile which exposed her true identity and showed the same tattoo.

Thus, the outlook for platforms such as DataWalk, designed to achieve efficiencies with the data, will be well received. They’ll be quickly adopted because they are less expensive than traditional systems, they deliver better and faster results, they are easy to train, and they’re extensible to keep up with new technology advances. Net-net, don’t take a knife to a gun fight.

How can a person interested in DataWalk contact you?

Take a look at our website http://www.datawalk.com and review the write-ups and videos. Request additional materials from info @ datawalk.com or you can contact me directly at: chris.westphal @ datawalk.com

DarkCyber Observations

DataWalk’s approach provides the analytic power of industry-standard services. The firm’s interface, its approach to customers and licensees, and its commitment to providing a system which outputs understandable results sets it apart. The firm’s technology has captured customers in the US Department of Justice, financial institutions, and organizations outside the US. This is an important player in the policeware and intelware sector.

Stephen E Arnold, October 21, 2020

Stephen E Arnold, publisher of “Dark Cyber Annex” and producer of DarkCyber, a weekly video news program for law enforcement and intelligence professionals. Access these information sources at www.arnoldit.com/wordpress.