Is it possible to extract the actual resolved dest. of url-shortener links from the Solr?

stembod · 30 December 2020 15:55

As time have shown, url-shortener services come along, and go away just as frequently. Leading to ‘link-rot’ when they eventually fizzle out…

So, i’ve been thinking^0.5 and drinking^10 , and now wondering if just maybe, perhaps, there might be some query magic of yacy solr API, using some other custom written util on the side to query and parse it; That could be used to extract data of wherever the ones the YaCy might’ve come across actually resolved to during yacy’s crawls?

Potentially killing 2 birds with one stone

Are there data found in a YaCy Solr that could potentially be used for such a thing?

https://tracker.archiveteam.org:1338/ (archiveteam.org 's ongoing effort)

Regards, a Solr-n00b

Orbiter · 31 December 2020 16:25

Well, storing original URLs from shortened URLs is a good idea. But this is completely independent from solr, that is just an indexer, not any kind of do-something-magic-with-urls.

So what we would need to do is:

find out if any url which was discovered anywhere is a shortened url or not
in case it is shortened, unshorten it
store the unshortened link together with the original link OR store only the unshortened instead the shortened and replace every occurrence anywhere (also in documents) with the unshortened

This would be something that should be done in the parser or after the parser has been finished. We already have a component which makes “semantic enrichment”, that is doing additional annotation like finding synonyms. That component whould have to take care of the unshortening.

Doing unshortening is actually easy, I implemented that in kaskelix (kaskelix.org) where all links that appear in tweets are actually shortened. Twitter is doing shortening all the time and sometimes de-shortened links are shortened as well (with another shortener).

Well all together is a big feature but we can set it to the list…

Tom_Booth · 1 January 2021 00:05

Perhaps not particularly related to the topic, but various services/platforms seem to be using “shortening” or some other form of substitution, sometimes lengthening… apparently for the purpose of controlling access to links posted as a form of “safeguard”, to “protect” platform members from “inappropriate” or potentially “dangerous” content that might lead to wrong think. or “misleading” news articles and no doubt for tracking and advertising purposes, data collection etc.

Generally, these “redirects” have no content of their own, so I’m not sure what YaCy might want to index anyway, but Google seems to sometimes, if not often or always, displays a website within a frame of some sort so the URL in the nav bar is actually something like https-googles-url-long-string-of-gibberish-:http:-actual-website-url-more-jibberish…

This extended or bracketed, or frame encapsuled url is another case.

I’ve often had to strip off the surrounding garbage when trying to copy/paste links to a forum so as to bypass Google’s apparent tracking/highjacking of or whatever one wishes to call it, links posted, which may subsequently spread like a virus as people share the link to some news story or whatever.

There again, I’m not sure Google’s “frame” or whatever it is bracketing the actual url has any REAL content to index.

This sort of link hijacking/redirection/tracking seems to be getting more and more prevalant whether it involves apparent shortening or lengthening of the actual url.

I personally have been infuriated when I post a link to share, then go back and find that twitter/facebook/youtube/google/whomever has programmatically inserted some link that is not the link I posted but some filter or redirection or tracking.

So, I guess the question is, how should, or how might, or how does YaCy handle such links.

I’m also curious.

I know there are services like tinyurl that may be in a different category, that people intentionally use, for one reason or another. Again, though, not THE REAL url to the actual resource.

I tend to think of all such url impostering as just so much garbage and internet congestion blocking the free flow of information and should probably just be stripped away and discarded if possible.

The entire domain name system itself is a kind of redirect on top of the actual ip

Perhaps a solution might be to discard any and all links that do not resolve to an actual ip address, though I cannot offer anything regarding how that could be implemented.

I don’t think YaCy should in any way, directly or indirectly, support such third party tracking and data mining by indexing links that would redirect traffic through such a link shortening “service”. regardless of how seemingly innocuous.

After reading through these policies, I’m not sure there are any that could be considered entirely innocuous. Sorry to say.

https://tinyurl.com/app/privacy-policy

stembod · 1 January 2021 12:34

@Tom_Booth @Orbiter

I don’t think YaCy should in any way, directly or indirectly, support such third party tracking and data mining by indexing links that would redirect traffic through such a link shortening “service”. regardless of how seemingly innocuous.

I agree wholeheartedly. YaCy should be focused on doing its own thing, And i don’t think it should be an added feature into YaCy. At the most perhaps only an optional plug-in sort of.

I was mainly wondering if it would be possible using another, separate (still not existing) program, to extract such info from a YaCy solr, or several of those.

e.g i see that /solr/select?core=collection1&q=host_s:goo.gl&start=0&rows=100 lists info such as e.g <str name="sku">http://goo.gl/RQsj6z</str> , which might even be useful for their effort of just discovering the RQsj6z part of an shortened url (in this case goo.gl) . I’m not sure.

Orbiter · 1 January 2021 13:03

Well about tracking: that does not work for tracker services, even if your crawler loads them!! Let me explain:

tracking means that someone is able to observe what you are doing. In this case (for all embedded links, either tracking pixels or redirects through shorteners): the referrer which is submitted by the browser tells the tracking service from which page you came; it tells the service what url you are looking at.
such tracking is disconnected from you as a YaCy search engine user two-fold: a) the crawler does not submit the referrer! b) as a search engine user you are not submitting anything at all unless you actually click on the link - then the referrer is the current search page (which is in the targeted p2p case your localhost so that this means nothing to a tracker)

So please don’t mix up things. Following a shortened link in the crawler is no security or privacy breach.

Tom_Booth · 1 January 2021 14:51

I think if someone visits a domain. Any domain. The server is able to record quite a bit of information. Some of that information is required or necessary for the internet to function. The server cannot send a redirect without knowing the ip address of the machine running the browser, unless someone is using Tor or whatever. Right? Maybe I’m wrong, but I’ve run programs on servers and this is very easy, even if just a one pixel transparent image is served for an instant from a redirect site. Cookies can be set, unless someone has cookies turned off. Browser, possibly location, etc.depending on the sophistication of the program running on the redirection service.

I’m not faulting YaCy. Far from it.

That is just the way the internet works. It is difficult to avoid

But I was thinking of a url shortener/redirect/server as something like a turnstile at a train depot. If you pass through the turnstile then you can be tracked. But the turnstile is not the destination. There is a more direct way of getting there. Go directly to the destination without passing through the turnstile.

All I’m saying is that I think it would be better for an indexing spider to point directly to the actual destination or resource rather than the turnstile, which is just a detour through, (according to their own “privacy policy”) a tracking site.

Am I not understanding something?

A YaCy user crawling the internet is not tracked, or that is not the issue I’m talking about.

I’m talking about once a resource is added to the index there is a search portal set up on the internet somewhere for people to use.

The search results that come up can either point directly to the actual resource or could, at least in theory, point to the shortened url which is, or would be inadvertantly sending the third party using the search portal to a tracking site, which would then redirect the browser to the actual resource.

It may be that YaCy already works in such a way as to avoid such a scenario, but then if so, why are we having this conversation?

stembod · 2 January 2021 01:02

I’m confused…

There is no tracking in resolving a shortened url, no more than there is tracking in DNS resolving…

it’s basically just akin to:

localhost -> 127.0.0.1
or
http://shitu.rl/immaBgonein2weeks -> https://archive.org/details/Whiskas-The_Falling_Log

or referrer i guess, in many cases.

…as i see it

That is to say, there’s undeniably quite a bit of tracking going on at the various shorturl service provicers …

Tom_Booth · 2 January 2021 07:21

I’m not sure what you mean by “resolving a shortened url”.

As I understand how URL shortening works, the short URL is just a random string appended to the shortening service URL like http://micro.url.s6td5f6c ( <–no such site BTW )

The only way to “resolve” it is through the shortening services server which associates the random string with the actual URL I don’t think there is any public way to “resolve” or find out what the actual URL is, which is why, if one of these services goes away, the website, though it still exists on the internet, is no longer available through that shortened URL.

This can be, and is, currently used by many platforms as a form of censorship. They control access to the resource through their own shortening methodology and can therefore redirect howsoever they may choose, gather analytics etc.

I don’t know why anyone would WANT to use any such service other than scammers to obfuscate their real or actual URL in spam emails.

HTML has built in “shortening” for webpages like click here!

stembod · 9 January 2021 00:00

My mistake. By resolving i simply spake in terms of e.g DNS/names-resolving; e.g How a domain/host name resolves to an actual IP address. And, in similar terms, how shortened urls resolves to an actual webpage (url).

E.g
In DNS situation ‘localhost’ commonly resolves to IP adress ‘127.0.0.1’ .
And in shortener terms, e.g ‘http://micro.url/s6td5f6c’ could hypothetically resolve to URL ‘https://www.adomain.com/~someuser/a_sub_folder/some_other_folder/some_book/chapter_1/page1.htm’

But it’s the process of figuring out what ‘http://micro.url/s6td5f6c’ actually is (what it points/links/redirects to), is what i meant when i say “resolving a shortened url” . There’s probably a more accurate term for it though

Tom_Booth · 9 January 2021 16:26

Not a mistake. I mean, we agree on the definition of resolving.

Ordinarily, though, the DNS tables that do the mapping are so-to-speak “public domain”. Shared and propagated throughout the world wide DNS system.

As far as I know, this is not the case with shortened URL’s. If the url shortening service closes up shop, taking their servers offline those forwarding tables are no longer available,

In theory, I suppose, if YaCy, in indexing sites using shortened URL’s retained this mapping information, possibly it could be retrieved, if the site was crawled when the shortening service was still online.

If that were possible at all, I imagine it would be very hit and miss and would probably not involve “resolving” by the usual means.

stembod · 9 January 2021 20:55

Thank you. This is the essence of my initial question

@Orbiter And, i think it is important to restate that my question is no way a request for an added feature in YaCy itself; Merely a request if the data is in an YaCy solr, info potentially be able to be pulled together by some other thing, utilizing YaCy’s solr(s).

YaCy should do what YaCy does best

Skål!

Tom_Booth · 9 January 2021 21:48

In any case where a webpage merely acts to forward the browser to another page, (possibly multiple times, shortened or not), I don’t know if YaCy retains any record of the intermediate “hops” taken.

A “normal” or common simple redirect would have a metadata tag for the redirect. I suspect a shortening service would likely use some more complicated script or program to effect the forwarding.

As YaCy’s (or any search engine’s) primary purpose is to index the content of an actual website, I suspect there would ordinarily be no reason to retain information regarding the “hops” taken to get there.

I take it from Orbiter’s response that it could be done, but as yet, has not been implemented.