{"id":8100,"date":"2026-01-22T05:02:00","date_gmt":"2026-01-21T20:02:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=8100"},"modified":"2026-01-18T14:06:22","modified_gmt":"2026-01-18T05:06:22","slug":"speech-hands-a-self-reflection-voice-agentic-approach-to-speech-recognition-and-audio-reasoning-with-omni-perception","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=8100","title":{"rendered":"Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception\u00a0<\/strong>[142.5]<br>\u6211\u3005\u306f,\u5916\u90e8\u306e\u97f3\u58f0\u77e5\u899a\u3092\u3044\u3064\u4fe1\u983c\u3059\u308b\u304b,\u3044\u3064\u5916\u90e8\u306e\u97f3\u58f0\u77e5\u899a\u3092\u76f8\u8ac7\u3059\u308b\u304b\u3092\u77e5\u308b\u3068\u3044\u3046,\u4e00\u8cab\u3057\u305f\u30b9\u30ad\u30eb\u3092\u5b66\u7fd2\u3059\u308b\u97f3\u58f0\u8a8d\u8b58\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3092\u5c0e\u5165\u3059\u308b\u3002 \u97f3\u58f0\u8a8d\u8b58\u3068\u5916\u90e8\u306e\u97f3\u58f0\u7406\u89e3\u30bf\u30b9\u30af\u306e\u4e21\u65b9\u3067\u30aa\u30e0\u30cb\u30e2\u30c7\u30eb\u3092\u9f3b\u3067\u5fae\u8abf\u6574\u3059\u308b\u3053\u3068\u306f\u3001\u3057\u3070\u3057\u3070\u6027\u80fd\u3092\u4f4e\u4e0b\u3055\u305b\u308b\u3002 \u3053\u308c\u3092\u89e3\u6c7a\u3059\u308b\u305f\u3081\u306b\u3001\u6211\u3005\u306e\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3067\u3042\u308bSpeech-Hands\u306f\u3001\u554f\u984c\u3092\u660e\u793a\u7684\u306a\u81ea\u5df1\u56de\u5e30\u6c7a\u5b9a\u3068\u3057\u3066\u518d\u8003\u3059\u308b\u3002\u3053\u306e\u5b66\u7fd2\u53ef\u80fd\u306a\u30d7\u30ea\u30df\u30c6\u30a3\u30d6\u306f\u3001\u30e2\u30c7\u30eb\u304c\u6b20\u9665\u306e\u3042\u308b\u5916\u90e8\u5019\u88dc\u306b\u3088\u3063\u3066\u8131\u7dda\u3055\u308c\u308b\u306e\u3092\u9632\u3050\u306e\u306b\u6709\u52b9\u3067\u3042\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2601.09413v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2601.09413v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Wed, 14 Jan 2026 12:06:50 GMT)<\/li>\n\n\n\n<li>\u300cIn this work, we proposed a learnable voice-agentic framework Speech-Hands for teaching omni models when to trust itself versus when to consult external audio perception. By casting the problem with explicit &lt;internal>, &lt;external>, and &lt;rewrite> action tokens, our experimental results across AudioQA and ASR benchmarks demonstrate strong performance improvements beyond strong baselines, especially when direct finetuning and GER training fail, Speech-Hands can still robustly generate the best prediction.\u300d\u3068\u306e\u3053\u3068\u3002\u300cWe aim to instill a form of computational self-reflection (Nelson, 1990) into an omni-modal agent, designing a collaborative framework that explicitly reasons about when to trust its own perception, when to defer to an expert, and even when to utilize tools\u300d\u3068\u3044\u3046\u30e2\u30c1\u30d9\u30fc\u30b7\u30e7\u30f3\u3002<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[714,713],"class_list":["post-8100","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-speech","tag-voice"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/8100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8100"}],"version-history":[{"count":2,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/8100\/revisions"}],"predecessor-version":[{"id":8102,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/8100\/revisions\/8102"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}