Intelligent Tagging Makes Unstructured Data Usable

March 20, 2020

We are not going to talk about indexing accuracy. Just keep that idea in mind, please.

Unstructured data is a nightmare nobody wants to handle. Within a giant unstructured mess, however, is usable information. How do you get to the golden information? There are multiple digital solutions, software applications, and big data tools that are supposed to get the job done. It raises another question: which tool do you choose? Among these choices is Intelligent Tagging from Refinitiv.

What is “intelligent tagging?”

“Intelligent Tagging uses natural language processing, text analytics and data-mining technologies to derive meaning from vast amounts of unstructured content. It’s the fastest, easiest and most accurate way to tag the people, places, facts and events in your data, and then assign financial topics and themes to increase your content’s value, accessibility and interoperability. Connecting your data consistently with Intelligent Tagging helps you to search smarter, personalize content recommendations and generate alpha.”

Intelligent Tagging can read through gigabytes of different textual information (emails, texts, notes, etc.) using natural language processing. The software structures data by assigning them tags, then forming connections from the content. After the information is organized, the search is empowered to quickly locate the desired information. Content can be organized in a variety of ways such as companies, people, location, topics, and more. Relevancy scores are added to determine how relevant a search indicator is to the search results. Intelligent Tagging also updates itself in real time by paying attention to the news and adding new metadata tags.

It is an optimized search experience and yields more powerful results in less time than similar software.

Intelligent Tagging offers a necessary service, but the only way to see if it promises to bring structure to data piles is to test it out.

Whitney Grace, March 20, 2020

Open RAN to Run Down Huawei

March 20, 2020

China’s top networking and telecom company Huawei is poised to dominate the world’s 5G wireless network, and we’re told the CIA took matters into its own hands years ago. The Washington Times reports, “CIA Funnels Cash to Private Company Aimed at Defeating Huawei.” Instead of waiting for Congress, the Justice Department, the Pentagon, and the White House to agree on a path, the CIA contracted venture capital fund In-Q-Tel to find a solution. Reporter Ryan Lovelace writes:

“Christopher Darby, the president and CEO of In-Q-Tel, told the House Permanent Select Committee on Intelligence last month that the venture capital fund began investing in 5G technology seven years ago. He said in a hearing that his fund has identified Parallel Wireless, a telecommunications service provider based in New Hampshire, as part of a government solution to concerns about Huawei’s threats to national security. Parallel Wireless uses a software-centered approach for building radio access network (RAN) capability that would ‘eliminate the need to spend millions of dollars on new equipment and infrastructure upgrades,’ according to a document on the company’s website. Parallel Wireless says its ‘Open RAN,’ which requires minimum maintenance, is ready for deployment immediately. Steve Papa, co-founder, chairman and CEO, said Parallel Wireless is fortunate that In-Q-Tel is proactively fighting Chinese domination of telecommunications. ‘Parallel Wireless is committed to destroying the threat Huawei poses to the free world,’ Mr. Papa said in an email. ‘We are actively working to ensure America and the world are free from the constraints of Huawei. Parallel works with many companies, many governments and many government agencies including In-Q-Tel.’”

Some in positions of power, like the attorney general, remain unconvinced Open RAN is a viable solution. Others are critical of In-Q-Tel itself. The fund was launched in 1999 as an investment firm for the intelligence community, which naturally means its investments were made with taxpayer dollars. Yet top-earning employees at the company have profited greatly from the fund’s success—several with annual salaries greater than $500,000 and with Darby himself making over $1.6 million in 2017 alone.

Be that as it may, it is beside the point of whether Parallel Wireless is indeed the answer to our Huawei problem. Will the CIA convince the rest of the federal government this is the solution?

Cynthia Murrell, March 20, 2020

Semantic Search Allegedly Adds A Boost To Product Discovery

March 20, 2020

Semantic search is one of the old reliable pieces of jargon for improving a search application, but it appears to be old hat. Semantic search, however, can, when correctly implemented, add a much needed boost for product discovery.

Grid Dynamics explains semantic magic in the article, “Boosting Product Discovery With Semantic Search.” We all know that human language is a complicated beast, which is why it has taken decades to develop decent voce to text and automated foreign language translation algorithms.

Humans learn from infancy to process speech based on the context and life experience. As technology has progressed, search engines are expected to perform the same actions which is where semantic search enters the game. Semantic search not only matches key words and phrases, but it brings meaning to them. Ecommerce Web sites require more than keyword and phrase search. Customers want to sort products based on price, brands, ratings, etc.

I am a librarian, and I know that irrelevant results often appear in any search and there are two types of these results: Obviously irrelevant values and values with subtle differences. A simple solution does not exist to fix all the irrelevant results.

Solutions are usually built a hybrid of semantic search and unstructured data. For the semantic search part, they must have: single words must be part of unbreakable multi-word phrases, business domain knowledge retracts/enhances query options, ambiguous matching need to be fixed with saliency to match attributes. Boolean queries also can be implemented in new ways to alter searches. Semantic search can also be used with different physical properties and merchandising rules.

Semantic search is a powerful tool for ecommerce Web sites, but:

“However, the power of semantic search largely depends on the richness and quality of the domain data – product attribution as well as synonyms. If your customers often perform out-of-dictionary search, then semantic search quality will suffer. It can include

• searches by subjective features like occasion of clothing (church dress) or age group for hi-tech device (laptops for kids)

• searches for brands which aren’t carried by your site, but it has similar products which can be suggested instead of just dropping the brand value from a query”

Never doubt how semantic search can improve a ecommerce search engine, but be sure to instill proper parameters for it to work correctly. Semantic search will remain a favorite of marketing whether a system is helping the person looking for information or hindering relevancy.

Whitney Grace, March 20, 2020

Wolfram Mathematica

March 19, 2020

DarkCyber noted “In Less Than a Year, So Much New: Launching Version 12.1 of Wolfram Language & Mathematica” contains highly suggestive information. Yes, this is a mathy program. The innovations are significant for analysts and some government professionals. To cite one example:

I’ve been recording hundreds of hours of video in connection with a new project I’m working on. So I decided to try our new capabilities on it. It’s spectacular! I could take a 4-hour video, and immediately extract a bunch of sample frames from it, and then—yes, in a few hours of CPU time—“summarize the whole video”, using SpeechRecognize to do speech-to-text on everything that was said and then generating a word cloud…

DarkCyber reacts positively to other additions and enhancements to the Mathematica “system.” Version 12.1 will make it easier to develop specific functions for policeware and intelware use cases.

Remarkable because the “system” can geo-everything. That’s important in many situations.

Stephen E Arnold, March 19, 2020

As Google Relies More on Its Smart Software, Smart Software Sells Protective Masks. Really?

March 19, 2020

DarkCyber noted “Senators Blast Google For Facemask Ads Amid Coronavirus, Demand FTC Action.” The senators are Mark Warner of Virginia and Richard Blumenthal of Connecticut.

What agitated these luminaries? The write up reports:

…despite Google announcing a ban on ads for protective facemasks last week, their staff were easily able to find Google ads for facemasks over the past week.

Who blew the whistle on Google’s smart software and ad serving machine?

The write up reports:

The senators told the FTC, “our staffs were consistently served dozens of ads for protective masks and hand sanitizer,” often when browsing news stories about the coronavirus.

DarkCyber thought big contributors and lobbyists were best positioned to pass information to these stalwarts of democracy.

The write up further offers this factoid:

“These ads, from a range of different advertisers, were served by Google on websites for outlets such as The New York Times, The Boston Globe, The Washington Post, CNBC, The Irish Times, and myriad local broadcasting affiliates,” the senators told the FTC. Google has made repeated representations to consumers that its policies prohibit ads for products such as protective masks. Yet the company appears not to be taking even rudimentary steps to enforce that policy,” they added.

Perhaps the humans at Google agreed to stop these ads. However, the memo may not have been processed by the smart ad sales system. Latency happens.

Some humans with knowledge of the offending module appear to have implemented a fix. (DarkCyber thought that Google’s code was not easily modified. Objectivity, relevance, and maybe revenue.

We were not able to get Google to display surgical mask ads as of 0947 Eastern on March 18, 2020. Progress and evidence that Google can control some of what appears in search results pages. Contradiction? Nope, just great software, managers, and engineers.

Stephen E Arnold, March 19, 2020

A New Horizon for Verizon: Swizzled Search Results

March 19, 2020

DarkCyber read “Yahoo, AOL, OneSearch Results Biased in Favor of Parent Company Verizon Media’s Web Sites.” The main idea seems to be that like baker’s in 11th century France a thumb on the scales could pay dividends. A gram here, a gram there.

The article asserts:

You may not be surprised to learn that the search results from all three of Verizon Media’s search engines are biased in favor of Verizon Media websites. Yahoo!, AOL, and OneSearch all boosts the ranking of Verizon Media brands in organic search results. That is to say, regular web results excluding ads, news, shopping, image, and video search results.

Surprised? Nope. What is the bit of revelatory factoid is that Bing indexes the Verizon content. Neither Bing nor Google reveals exactly how many Web sites their respective systems index. Useless information like how many links the crawlers follow in a Web site is not made explicit.

DarkCyber’s test queries suggest that Bing indexes only sites with a higher probability of being clicked. We have noted that for some queries, the Bing results closely parallel Google’s. Bing search administrators, are you monitoring Mother Google?

Therefore, such a happy coincidence that Bing indexes and displays in a favorable position the Verizon owned sites. In the good old days, the approach was called hit boosting. Today it probably has the words artificial intelligence and semantic technology obfuscating shaping content to meet a specific business need.

Progress in search? Absolutely just search engine optimization, however.

Stephen E Arnold, March 19, 2020

 

https://www.ctrl.blog/entry/verizon-media-search.html

Israel and Mobile Phone Data: Some Hypotheticals

March 19, 2020

DarkCyber spotted a story in the New York Times: “Israel Looks to Repurpose a Trove of Cell Phone Data.” The story appeared in the dead tree edition on March 17, 2020, and you can access the online version of the write up at this link.

The write up reports:

Prime Minister Benjamin Netanyahu of Israel authorized the country’s internal security agency to tap into a vast , previously undisclosed trove of cell phone data to retract the movements of people who have contracted the corona virus and identify others who should be quarantined because their paths crossed.

Okay, cell phone data. Track people. Paths crossed. So what?

Apparently not much.

The Gray Lady does the handwaving about privacy and the fragility of democracy in Israel. There’s a quote about the need for oversight when certain specialized data are retained and then made available for analysis. Standard journalism stuff.

DarkCyber’s team talked about the write up and what the real journalists left out of the story. Remember. DarkCyber operates from a hollow in rural Kentucky and knows zero about Israel’s data collection realities. Nevertheless, my team was able to identify some interesting use cases.

Let’s look at a couple and conclude with a handful of observations.

First, the idea of retaining cell phone data is not exactly a new one. What if these data can be extracted using an identifier for a person of interest? What if a time-series query could extract the geolocation data for each movement of the person of interest captured by a cell tower? What if this path could be displayed on a map? Here’s a dummy example of what the plot for a single person of interest might look like. Please, note these graphics are examples selected from open sources. Examples are not related to a single investigation or vendor. These are for illustrative purposes only.

image

Source: Standard mobile phone tracking within a geofence. Map with blue lines showing a person’s path. SPIE at https://bit.ly/2TXPBby

Useful indeed.

Second, what if the intersection of two or more individuals can be plotted. Here’s a simulation of such a path intersection:

image

Source: Map showing the location of a person’s mobile phone over a period of time. Tyler Bell at https://bit.ly/2IVqf7y

Would these data provide a way to identify an individual with a mobile phone who was in “contact” with a person of interest? Would the authorities be able to perform additional analyses to determine who is in either party’s social network?

Third, could these relationship data be minded so that connections can be further explored?

Image result for analyst notebook mapping route

Source:  Diagram of people who have crossed paths visualized via Analyst Notebook functions. Globalconservation.org

Can these data be arrayed on a timeline? Can the routes be converted into an animation that shows a particular person of interest’s movements at a specific window of time?

image

Source: Vertical dots diagram from Recorded Future showing events on a timeline. https://bit.ly/39Xhbex

These hypothetical displays of data derived from cross correlations, geotagging, and timeline generation based on date stamps seem feasible. If earnest individuals in rural Kentucky can see the value of these “secret” data disclosed in the New York Times’ article, why didn’t the journalist and the others who presumably read the story?

What’s interesting is that systems, methods, and tools clearly disclosed in open source information is overlooked, ignored, or just not understood.

Now the big question: Do other countries have these “secret” troves of data?

DarkCyber does not know; however, it seems possible. Log files are a useful function of data processes. Data exhaust may have value.

Stephen E Arnold, March 19, 2020

Startup Gretel Building Anonymized Data Platform

March 19, 2020

There is a lot of valuable but sensitive data out there that developers and engineers would love to get their innovative hands on, but it is difficult to impossible for them to access. Until now.

Enter Gretel, a startup working to anonymize confidential data. We learn about the upcoming platform from Inventiva’s article, “A Group of Ex-NSA And Amazon Engineers Are Building a ‘GitHub for Data’.” Co-founders Alex Watson, John Myers, Ali Golshan, and Laszlo Bock were inspired by the source code sharing platform GitHub. Reporter surbhi writes:

“Often, developers don’t need full access to a bank of user data — they just need a portion or a sample to work with. In many cases, developers could suffice with data that looks like real user data. … ‘We’re building right now software that enables developers to automatically check out an anonymized version of the data set,’ said Watson. This so-called ‘synthetic data’ is essentially artificial data that looks and works just like regular sensitive user data. Gretel uses machine learning to categorize the data — like names, addresses and other customer identifiers — and classify as many labels to the data as possible. Once that data is labeled, it can be applied access policies. Then, the platform applies differential privacy — a technique used to anonymize vast amounts of data — so that it’s no longer tied to customer information. ‘It’s an entirely fake data set that was generated by machine learning,’ said Watson.”

The founders are not the only ones who see the merit in this idea; so far, the startup has raised $3.5 million in seed funding. Gretel plans to charge users based on consumption, and the team hopes to make the platform available within the next six months.

Cynthia Murrell, March 19, 2020

Machine Learning Foibles: Are We Surprised? Nope

March 18, 2020

Eurekalert published “Study Shows Widely Used Machine Learning Methods Don’t Work As Claimed.” Imagine that? The article states:

Researchers demonstrated the mathematical impossibility of representing social networks and other co0mplex networks using popular methods of low dimensional embeddings.

To put the allegations and maybe mathematical proof in context, there are many machine learning methods and even more magical thresholds the data whiz kids fiddle to generate acceptable outputs. The idea is that as long as the outputs are “good enough”, the training method is okay to use. Statistics is just math with some good old fashioned “thumb on the scale” opportunities.

The article states:

The study evaluated techniques known as “low-dimensional embeddings,” which are commonly used as input to machine learning models. This is an active area of research, with new embedding methods being developed at a rapid pace. But Seshadhri and his coauthors say all these methods share the same shortcomings.

What are the shortcomings?

Seshadhri and his coauthors demonstrated mathematically that significant structural aspects of complex networks are lost in this embedding process. They also confirmed this result by empirically by testing various embedding techniques on different kinds of complex networks.

The method discards or ignores information, relying on a fuzz ball which puts an individual into a “geometric representation.” Individuals’ social connections are lost in the fuzzification procedures.

Big deal. Sort of. The paper opens the door to many graduate students’ beavering away on the “accuracy” of machine learning procedures.

Stephen E Arnold, March 18, 2020

The Google: Geofence Misdirection a Consequence of Good Enough Analytics?

March 18, 2020

What a surprise—the use of Google tracking data by police nearly led to a false arrest, we’re told in the NBC News article, “Google Tracked his Bike Ride Past a Burglarized Home. That Made him a Suspect.” Last January, programmer and recreational cyclist Zachary McCoy received an email from Google informing him, as it does, that the cops had demanded information from his account. He had one week to try to block the release in court, yet McCoy had no idea what prompted the warrant. Writer Jon Schuppe reports:

“There was one clue. In the notice from Google was a case number. McCoy searched for it on the Gainesville Police Department’s website, and found a one-page investigation report on the burglary of an elderly woman’s home 10 months earlier. The crime had occurred less than a mile from the home that McCoy … shared with two others. Now McCoy was even more panicked and confused.”

After hearing of his plight, McCoy’s parents sprang for an attorney:

“The lawyer, Caleb Kenyon, dug around and learned that the notice had been prompted by a ‘geofence warrant,’ a police surveillance tool that casts a virtual dragnet over crime scenes, sweeping up Google location data — drawn from users’ GPS, Bluetooth, Wi-Fi and cellular connections — from everyone nearby. The warrants, which have increased dramatically in the past two years, can help police find potential suspects when they have no leads. They also scoop up data from people who have nothing to do with the crime, often without their knowing ? which Google itself has described as ‘a significant incursion on privacy.’ Still confused ? and very worried ? McCoy examined his phone. An avid biker, he used an exercise-tracking app, RunKeeper, to record his rides.”

Aha! There was the source of the “suspicious” data—RunKeeper tapped into his Android phone’s location service and fed that information to Google. The records show that, on the day of the break-in, his exercise route had taken him past the victim’s house three times in an hour. Eventually, the lawyer was able to convince the police his client (still not unmasked by Google) was not the burglar. Perhaps ironically, it was RunKeeper data showing he had been biking past the victim’s house for months, not just proximate to the burglary, that removed suspicion.

Luck, and a good lawyer, were on McCoy’s side, but the larger civil rights issue looms large. Though such tracking data is anonymized until law enforcement finds something “suspicious,” this case illustrates how easy it can be to attract that attention. Do geofence warrants violate our protections against unreasonable searches? See the article for more discussion.

Cynthia Murrell, March 18, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta