Migrating to Hoarder
I have been on a mission recently to regain control of my data. I haven’t yet faced the humongous task of moving my main email from Gmail but I have had some successes with other cloud services and a win is a win….
One of them is my bookmark manager. Up until now I used Pocket since way back in the day when it was originally called “Read it Later”. I wanted to bring it away from a cloud service and host locally, I also had the opportunity to add a little AI magic to go with it.
Table of Contents
What is hoarder?
Taken from the Hoarder website it’s an App that can do all of this….
- 🔗 Bookmark links, take simple notes and store images and pdfs.
- ⬇️ Automatic fetching for link titles, descriptions and images.
- 📋 Sort your bookmarks into lists.
- 🔎 Full text search of all the content stored.
- ✨ AI-based (aka chatgpt) automatic tagging. With supports for local models using ollama!
- 🎆 OCR for extracting text from images.
- 🔖 Chrome plugin and Firefox addon for quick bookmarking.
- 📱 An iOS app, and an Android app.
- 📰 Auto hoarding from RSS feeds.
- 🔌 REST API.
- 🌐 Mutli-language support.
- 🖍️ Mark and store highlights from your hoarded content.
- 🗄️ Full page archival (using monolith) to protect against link rot. Auto video archiving using youtube-dl.
- ☑️ Bulk actions support.
- 🔐 SSO support.
- 🌙 Dark mode support.
- 💾 Self-hosting first.
The main use for me is having a website and mobile app for saving pages useful to me that I found when I’m usually down a tech rabbit hole.
Hoarder Architecture
I have chosen to run this within my home infrastructure and connected it to my existing Ollama setup This means that Hoader can call Olama for AI text/image clarification using the setup and models I have already created.
The deployment is all within Docker and I have added extracts from my Docker Compose files below.
Hoarder Install
Snippet from my docker-compose.yml
Hoarder
hoarder:
image: ghcr.io/hoarder-app/hoarder:${HOARDER_VERSION:-release}
restart: unless-stopped
networks:
- traefik
volumes:
- ./data:/data
env_file:
- .env
environment:
MEILI_ADDR: http://meilisearch:7700
BROWSER_WEB_URL: http://chrome:9222
DATA_DIR: /data
labels:
- "com.example.description=hoarder"
- "traefik.enable=true"
- "traefik.http.routers.hoarder.rule=Host(`hoarder.jameskilby.cloud`)"
- "traefik.http.routers.hoarder.entrypoints=https"
- "traefik.http.routers.hoarder.tls=true"
- "traefik.http.routers.hoarder.tls.certresolver=cloudflare"
- "traefik.http.services.hoarder.loadbalancer.server.port=3000"
chrome:
image: gcr.io/zenika-hub/alpine-chrome:123
restart: unless-stopped
networks:
- traefik
command:
- --no-sandbox
- --disable-gpu
- --disable-dev-shm-usage
- --remote-debugging-address=0.0.0.0
- --remote-debugging-port=9222
- --hide-scrollbars
meilisearch:
image: getmeili/meilisearch:v1.11.1
restart: unless-stopped
networks:
- traefik
env_file:
- .env
environment:
MEILI_NO_ANALYTICS: "true"
volumes:
- ./meilisearch:/meili_data
.env snippet file
OLLAMA_BASE_URL=http://ollama:11434
INFERENCE_TEXT_MODEL=llama3.1:8b
INFERENCE_IMAGE_MODEL=llava
Export from Pocket
Luckily Pocket has an export function that will dump all of your saved URL’s and tags into a single file. This can be run by navigating to https://getpocket.com/export when logged in.
Import to Hoarder
Once you have this file you can input it straight into Hoarder. This is done by navigating to the user settings section and then selecting import/export.
Hoarder supports a number of file formats from other tools
Hoarder In Action
When the URL’s are loaded. Hoarder passes the URL’s into a headless Chrome to gather the data from that page. It then index’s the contents and then passes the contents to Ollama to apply appropriate tags.
AI Prompt
You can tweak the AI prompt that is sent over to Ollama. In my case I have just used the default as it looked a good starting point. The prompt is
You are a bot in a read-it-later app and your responsibility is to help with automatic tagging.
Please analyze the text between the sentences "CONTENT START HERE" and "CONTENT END HERE" and suggest relevant tags that describe its key themes, topics, and main ideas. The rules are:
- Aim for a variety of tags, including broad categories, specific keywords, and potential sub-genres.
- The tags language must be in english.
- If it's a famous website you may also include a tag for the website. If the tag is not generic enough, don't include it.
- The content can include text for cookie consent and privacy policy, ignore those while tagging.
- Aim for 3-5 tags.
- If there are no good tags, leave the array empty.
CONTENT START HERE
<CONTENT_HERE>
CONTENT END HERE
You must respond in JSON with the key "tags" and the value is an array of string tags.
The process of gathering all the webpages, indexing them and analyzing with AI took around an hour with my setup. I believe that Hoarder has some internal throttles to try and avoid tripping anti-bot tools.
You can monitor this process in the admin section, I was also keen to see the stats on my graphics card while this was running so I ran NVTOP on the VM while some AI processing was running.
Finished Result
At the end of the process I ended up with a little over 750 bookmarks imported and fully indexed and displayed in a very appealing nature.
Post Install Tasks
Tag Merge
The AI process is mightily impressive but one of the issues that can occur is it generate similar tags. Hoarder has the ability to allow you to merge them.
This is done in the cleanups section under your username. In my case it suggested 150 tags that may need merging. At the moment I am glad this has manual oversight as some should definatly be merged like “Motherboard” and “Motherboards” but some are totally different like “NAS” and “NASA”
Broken Links
I was quite surprised how many of the links I had were broken links. These were for a few reasons ranging from company take overs. To sites being dead. Sadly in one case the auther is no longer with us. In a few cases the broken link was because the URL I had saved led to a URL shortener that no longer exists. I need to revist the remaining links in the list and ensure that they point to the end state.