Tag-AI supports two AI models for generating image tags: local processing with LLaVA and cloud processing with Google's Gemini API.
Each model has distinct advantages and ideal use cases. You can switch between them based on your needs.
LLaVA (Large Language and Vision Assistant) is a local, privacy-focused multimodal model that runs on your own computer.
LLaVA typically generates 15-20 tags per image, focusing on:
mountain, landscape, snow, trees, forest, sky, cloudy, nature, outdoor, scenery, wilderness, tranquil, valley, peak, rocky, winter, evergreen, alpine, daylight
Google's Gemini API is a cloud-based vision model offering high-quality image analysis.
Gemini typically generates 30-50 tags per image, with more specific identification of:
mountain, alpine, snow-capped, evergreen trees, coniferous forest, valley, clouds, overcast, dramatic landscape, wilderness, nature, outdoors, hiking destination, mountain range, rocky terrain, alpine meadow, mountain trail, scenic vista, photography, tranquil scene, panoramic view, natural beauty, backpacking, mountaineering, pristine, conservation, national park, ecological diversity, winter scene, environmental photography
Feature | Local (LLaVA) | Cloud (Gemini) |
---|---|---|
Privacy | Excellent (fully local) | Limited (cloud processing) |
Tag Quality | Good | Excellent |
Tag Quantity | 15-20 tags per image | 30-50 tags per image |
Processing Speed | Depends on hardware | Depends on internet connection |
Resource Usage | High (local CPU/GPU) | Low (cloud-based) |
Internet Required | Only for setup | Always |
Cost | Free (after Tag-AI purchase) | Free tier or paid API subscription |
Rate Limits | None (hardware-dependent) | Yes (API quota) |
To switch between tagging models:
local
- Use LLaVA local processinggemini
- Use Google's Gemini APIIf selecting Gemini, ensure you've configured your API key in the [tagger_gemini] section.
Before sending images to either model, Tag-AI:
Tag-AI uses specific prompts for each model:
Please analyze the provided image and follow these steps:
1. Look at the image.
2. List a minimum of 15 distinct tags that capture specific attributes of the image.
3. Return the tags in a single line separated by a comma and a space.
Generate comma-separated detailed tags for this image, describing subject, context, and visual details.
After receiving tags from either model, Tag-AI:
Advanced users can modify the configuration to use different local models:
ollama pull bakllava
or ollama pull llava-v1.6-34b
Only multimodal vision-language models will work. Pure language models without vision capabilities will fail.
Other models can be found on the Ollama model library.