API crawler and API RSS feed adding

roamn · 28 December 2024 09:02

I’ve been trying to mass add RSS feeds to yacy’s via the API no luck yet.

I got a web search to work after plugging in the example to ChatGPT https://yacy.net/api/crawler/

I changed “range to domain” it was set to “wide”

curl -G "http://localhost:8090/Crawler_p.html" \
  --data-urlencode "crawlingDomMaxPages=10000" \
  --data-urlencode "range=domain" \
  --data-urlencode "intention=" \
  --data-urlencode "sitemapURL=" \
  --data-urlencode "crawlingQ=on" \
  --data-urlencode "crawlingMode=url" \
  --data-urlencode "crawlingURL=https://community.searchlab.eu/" \
  --data-urlencode "crawlingFile=" \
  --data-urlencode "mustnotmatch=" \
  --data-urlencode "crawlingFile$file=" \
  --data-urlencode "crawlingstart=Neuen Crawl starten" \
  --data-urlencode "mustmatch=.*" \
  --data-urlencode "createBookmark=on" \
  --data-urlencode "bookmarkFolder=/crawlStart" \
  --data-urlencode "xsstopw=on" \
  --data-urlencode "indexMedia=on" \
  --data-urlencode "crawlingIfOlderUnit=hour" \
  --data-urlencode "cachePolicy=iffresh" \
  --data-urlencode "indexText=on" \
  --data-urlencode "crawlingIfOlderCheck=on" \
  --data-urlencode "bookmarkTitle=" \
  --data-urlencode "crawlingDomFilterDepth=1" \
  --data-urlencode "crawlingDomFilterCheck=on" \
  --data-urlencode "crawlingIfOlderNumber=1" \
  --data-urlencode "crawlingDepth=4"

A way to check your lists to crawl.

cat feeds.csv | xargs -n 1 -P 10 -I {} bash -c 'curl -s -o /dev/null -w "%{http_code}" {} && echo {} >> valid_urls.txt || echo {} >> invalid_urls.txt'

okybaca · 31 December 2024 11:46

what does it mean ‘with no luck yet’? what’s the problem there?

roamn · 1 January 2025 01:08

It shows up in the log but the crawl wont start working.