How to Boost via domain extension?

BeZazz · 2 November 2024 01:06

I have spent a lot of time trying to Boost results for .au domains.

The ChatGPT bot seem to be full of old information. As nothing it told me to do, worked. The settings to be changed and files to be altered did not exist.
Creation the files/changes made no effect.

Does anyone know a way to boost results from a certain domain/s extension?

Thank you.

BeZazz · 2 November 2024 06:25

I give up.
There is no way to get help with YACY.
Developers just assume you know what everything is. It’s like they created it and know what it does and that’s good enough.

The AI while good for basic stuff but is not good for anything else. As it seems to be stuck about a decade ago. Wanting you to edit stuff that no longer exists.

Currently stuck with curl 401ing all the time, no matter what I try. So I cant advance on the orig issue.
Search for an answer here on the forum and there is an unanswered question (most here seem to be self replied) about it from 2019 still waiting for help, related to the 401 of curl and advice.

It’s a shame that the docs and help are so lacking. It could/should be a thriving community with a great piece of software.

EDIT: API endpoints always respond with 401 Unauthorized · Issue #354 · yacy/yacy_search_server · GitHub Looks like a feature that it doesn’t work as stated.
If you don’t how to, there is no real explanation on how, just on how normal usage doesn’t work…

okybaca · 2 November 2024 10:50

Hi,
I understand your frustration.

It seems to me that original developer, @Orbiter, is more into another projects (after these 18 years or something), so the project really lacks both maintenance and support. Time to time, there is a commit or even release of new version, althought unsystematical.

Chatbot is for sure fed with the old documentation, so we couldn’t expect miracles.

I try to make the documentation, but I’m a mere user, so a lot of things is a blackbox for me. I don’t do Java, so code contribution is mission impossible for me.

What we lack is:

java developers,
community, helping the others in the forum (i wonder why a lot of questions are unresponded in the forum or why people come only once here).

Still, YaCy is the only working p2p search engine. The idea is great, the implementation is still IMHO immature. For me, it’s still on the edge: does it worth to take time and effort to contribute, or is it going to be abandoned?

At least three things are there, what everyone can do:

answer newbie or other user questions in the forum and/or github issues. Lack of feedback is probably the most frustrating thing.
extract information already published and contribute that into documentation or documentation fork (which I personaly run at: [YaCy Docs],
if you understand Java, try to fix some issues or contribute a new code

I have experimented with YaCy node for some years now, somehow working, but still not ripe for production use (mainly speed, memory and relevance issues for me).

Seems that we’ll have only as good YaCy as we manage to make ourselves, not relying on the original authors.

okybaca · 2 November 2024 10:59

I personally don’t have a lot of experience with boosting the queries, but from what I understand, it’s a way to use a direct solr options in queries. Docs - Definition of Ranking Rules. If I had to contend with that, I’d try to play with some solr queries directly, using /solr/select?core=collection1&q=... interface. But frankly, I don’t have any more specific advice for you.

BeZazz · 2 November 2024 12:06

Thanks for your reply.

I know a bit of Java. (started programming when I was 16, 1987)
I was going to look into development and contributing to the docs but considering I cant even gather enough information to boost a domain. I’m having second thoughts.

I have seen people wait a year for their changes to be submitted. It is a shame that it has been left to die a slow death.

If I can workout how boost au domains (part way there) I’ll hang around, otherwise with the randomly bad search results and utter lack of any real help/docs (not you) I’ll give up and find something else to do with the server.

You appear to be the main/only active helper here.

I would like to hang around and contribute, guess I’ll see how I go in the next couple of hours and decide.

okybaca · 2 November 2024 13:06

very true. that led me to make my own fork of both yacy and docs, i try to commit the changes into mainstream, but my latest doc hangs in PRs for months again. better to not rely on central authorities.

i wish i wouldn’t be the only one.
@joestr made some contributions in previous year, @inonkps was interested in developement, @sixcooler runs 5 biggest yacy nodes and sometimes contributes a code, @orbiter, the main author, most probably works in batches, time to time, now probably working on solr 9 and some AI & vector search stuff. @roamn (.au based?) runs his wild but interesting performance experiments and stress tests. @akdk7 recently revived & crafted his yacy-stats.de site.
so it’s probably not dead, only… sleeping? fragmented? definitely not very actively maintained. but we could have forks.

BeZazz · 3 November 2024 02:14

While there is probably a better field to use. I made do with what was already enabled.
<str name="host_s">myfoodbook.com.au</str>

Done some brief testing and this seems to be working to boost specific domain extensions.
I placed the bellow in Boost Query

host_s = query => *.au^50

okybaca · 3 November 2024 13:09

Glad you managed!

BeZazz · 3 November 2024 22:36

Thank you

okybaca · 13 November 2024 10:33

I’ve added that into the docs (PR). Hope that’s correct. If not, feel free to edit or extend.

BeZazz · 13 November 2024 12:50

That’s perfectly fine.
I do intend to get notes (imagine scribble currently) together to pass on but 5-6 weeks in. I have zero indexes lol. various (learning) reasons…

But I do have python code (thanks free ChatGPT) that takes YACY’s top x results, then resorts with Spacy.
Spacy reads cached document and the pages are re ranked.

okybaca · 8 December 2024 14:07

nice! would you thare the code?

BeZazz · 8 December 2024 22:35

After Hugging Face and spaCy. I have been messing about with straight Python.
Hugging Face and spaCy went pretty good but too complicated (for average person) IMO due to possible dependency hell.
So I went with straight Python as it is easier for anyone to set up.

The code is slow and still being worked on but works.

http://bezazz.com/Archive.zip

EDIT: Would need to pip install
aiohttp
flask-caching
requests
beautifulsoup4
markupsafe
I recommend a venv — Creation of a virtual environment

uvicorn chatGPT:app --host 0.0.0.0 --port 5000

EDIT2: Here is where I stopped with spacy. Above is Python only.

import requests
import spacy
from flask import Flask, render_template, request
from urllib.parse import quote_plus

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# YaCy instance URL and search parameters
yacy_url = "http://localhost:8090/yacysearch.json"

app = Flask(__name__)

def rank_result_by_spacy(text):
    """Process the text using SpaCy and return a score based on entity recognition."""
    doc = nlp(text)
    entities = [ent for ent in doc.ents if ent.label_ in ["PERSON", "ORG", "GPE"]]  # Example of scoring based on entities
    return len(entities)

def query_yacy(query):
    """Query YaCy and retrieve search results."""
    params = {
        "query": query,        # The search term entered by the user
        "maximumRecords": 5,    # Get top 5 results
        "contentdom": "text"    # Search only for text-based content
    }
    
    try:
        response = requests.get(yacy_url, params=params)
        response.raise_for_status()
        data = response.json()
        return data.get("channels", [])
    except requests.RequestException as e:
        print(f"Error querying YaCy: {e}")
        return []

def get_cached_page(url):
    """Fetch the cached page from YaCy."""
    cached_url = f"http://localhost:8090/ViewCachedPage?url={quote_plus(url)}"
    try:
        response = requests.get(cached_url)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error retrieving cached page for URL {url}: {e}")
        return None

@app.route("/", methods=["GET", "POST"])
def home():
    results_with_scores = []
    if request.method == "POST":
        query = request.form.get("query")
        if query:
            results = query_yacy(query)
            # Process and rank the results using SpaCy
            for result in results:
                page_title = result.get("title")
                page_url = result.get("link")

                # Step 1: Get the cached page content
                cached_content = get_cached_page(page_url)
                if cached_content:
                    # Step 2: Process the cached page content with SpaCy
                    score = rank_result_by_spacy(cached_content)
                else:
                    # If no cached content, set score to 0
                    score = 0

                results_with_scores.append((score, page_title, page_url))

            # Sort results by SpaCy score
            results_with_scores.sort(reverse=True, key=lambda x: x[0])

    return render_template("index.html", results=results_with_scores)

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=5000, debug=True)

I just asked chatGPT to generate an index.html as that is long gone.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>YaCy Search</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/bootstrap/4.5.2/css/bootstrap.min.css">
    <style>
        body {
            padding-top: 50px;
        }
        .result {
            border-bottom: 1px solid #ddd;
            padding-bottom: 15px;
            margin-bottom: 15px;
        }
        .score {
            font-size: 1.2em;
            font-weight: bold;
        }
    </style>
</head>
<body>
    <div class="container">
        <h1 class="text-center">YaCy Search with SpaCy Ranking</h1>

        <!-- Search Form -->
        <form method="POST" class="form-inline justify-content-center">
            <input type="text" name="query" class="form-control" placeholder="Search..." required>
            <button type="submit" class="btn btn-primary ml-2">Search</button>
        </form>

        {% if results %}
        <div class="mt-4">
            <h3>Search Results</h3>
            <div class="list-group">
                {% for score, title, url in results %}
                <div class="result">
                    <h4><a href="{{ url }}" target="_blank">{{ title }}</a></h4>
                    <p class="score">Score: {{ score }}</p>
                    <p><a href="{{ url }}" target="_blank">{{ url }}</a></p>
                </div>
                {% endfor %}
            </div>
        </div>
        {% endif %}
        
        {% if not results %}
        <div class="mt-4">
            <p>No results found. Please try another search.</p>
        </div>
        {% endif %}
    </div>

    <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.9.3/dist/umd/popper.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body>
</html>