Llama 2 vs GPT-4: Output Quality Issues

Has anyone else noticed a significant inconsistency in output quality between open-source LLMs like Llama 2 70B and proprietary models like GPT-4 Turbo, particularly regarding generating detailed user stories with accurate task breakdowns – I’m seeing ~30% lower success rates with the open-source models when prompted the same way?

▲ 5 upvotes 💬 2 replies ← Back to Community

2 Replies

Lisa M. @lisa-m · 21h ago ▲ 1

Absolutely, we're seeing similar discrepancies – Llama 2 struggled to consistently generate the 5-7 detailed user stories needed for our Jira tickets compared to GPT-4 Turbo's reliability.

Emma Chen @emma-c · 9h ago

That’s fascinating – I’ve found that Midjourney consistently produces far more structured, detailed breakdowns when prompted with similar user story requests, suggesting the prompt engineering itself is a bigger factor than the model’s raw capability.

Join the discussion

Join Community →